Sigma delta quantization for images

ABSTRACT

A technique is presented for quantizing pixels of an image using Sigma Delta quantization. In one aspect, pixel values for an image are segmented into columns of pixel values; and for each column in the matrix, pixel values of a given column are quantized using sigma delta modulation. The pixel values in a given column are preferably quantized as a whole, thereby minimizing accumulated quantization error from a starting pixel value in the given column to a current pixel value in the given column. In another aspect, the pixels of an image are quantized using a 2D generalization of Sigma Delta modulation.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No.63/024,861, filed on May 14, 2020. The entire disclosure of the aboveapplication is incorporated herein by reference.

GOVERNMENT CLAUSE

This invention was made with government support under CCF1909523 awardedby the National Science Foundation. The government has certain rights inthe invention.

FIELD

The present disclosure relates to techniques quantizing pixels of animage.

BACKGROUND

In digital signal processing, quantization is the step of converting asignal's real-valued samples into a finite string of bits. As the firststep in digital processing, it plays a crucial role in determining theinformation conversion rate and the reconstruction quality.Mathematically, given a signal class

⊆

^(N) and a fixed codebook

, the goal of the quantization is to find for every signal x in

a codebook representation q ∈

so that it can be stored digitally. Q is used to denote the quantizationmap between the signal space

and the codebook

Q:

→

:x→q.

Quantization schemes are usually equipped with reconstructingalgorithms, that can reconstruct the original signal from the encodedbits. To ensure practicability, the reconstruction algorithms have to besolvable in polynomial times. More explicitly, the algorithm, denoted byΔ, should be able to reconstruct every signal x ∈

from their encoded vector q in polynomial time up to some smalldistortion

Distortion:=∥{circumflex over (x)}−x∥₂≡∥Δ(q)−x∥₂.

For a given signal class

, define the optimal quantization Q to be the one optimizing the bitrate distortion defined as the minimal possible distortion under a fixedbit budget. Let R be the fixed budget, among all codebook

representable in R bits, the optimal quantization Q is the one thatminimizes the minimax distortion

$\begin{matrix}{{\overset{\hat{}}{Q} = {\arg\mspace{11mu}{{{\Delta \cdot {Q(x)}} - x}}_{2}}},} & (x)\end{matrix}$

where

is the class of polynomial time decoders.

If the signal class

forms a compact metric space, one can find the optimal quantizer usingan information theoretical argument. Given a fixed approximation errorϵ, one can find infinitely many ϵ-nets of

. The smallest possible cardinality of the ϵ-nets is called the coveringnumber N(ϵ). Suppose such an optimal ϵ-net is given, define thequantization Q as the map that sends every point in

to the center of the ϵ-ball containing this point. Using the binaryrepresentation, the number of bits needed to encode the centers isR=log₂N(ϵ). This relation reduces to ϵ˜2^(−R/d) when

is the

₂ ball in

^(d) see, e.g.,[19]), where ˜ means the two sides are equal to eachother up to some constant. This relation ϵ˜2^(−R/d) is called theexponential relation, which is the best decay rate for ϵ as R increases.However, this optimal quantization scheme suffers from the followingimpracticality: 1) unless

has a regular shape, finding the ϵ-covering for

suffers from the curse of dimensionality; 2) the scheme cannot beoperated in an online manner as the nearest center of x can only befound after all samples of x have been received; and 3) if extra samplescome in, the ϵ-net needs to be recalculated.

These concerns inspire people to impose the following practicalrequirements on the quantizer Q. The quantization q should have the samesize as the signal x. Q should quantize each entry of x to an entry of qin an online manner, which means q_(i) (the ith entry of q) only dependson the historial inputs x₁, . . . , x_(i) not the future ones. Thealphabets

_(i) for each q_(i) are the same and fixed in advance, i.e.,

₁= . . .

_(n)=z,181 . Together they form the codebook

=

^(n). As the quantization is implemented in the analog hardware, themathematical operations should be kept as simple as possible. Inparticular, addition/subtraction are much more preferred circuitoperations than multiplication/division. Here, for simplicity, alphabet

is chosen from the class of finite equal-spacing grids with step-size δ,

_(δ) ={c+Jδ,c∈

,J∈

}J ₁ <J<J ₂  (1.1)

This section provides background information related to the presentdisclosure which is not necessarily prior art.

SUMMARY

This section provides a general summary of the disclosure, and is not acomprehensive disclosure of its full scope or all of its features.

In one aspect, a technique is presented for quantizing pixels of animage using 1D Sigma Delta quantization. The method includes: receivingpixel values for an image captured by an imager at a sequencing circuit;segmenting or sequencing the pixel values of the matrix into columns ofpixel values; and for each column in the matrix, quantizing pixel valuesof a given column using sigma delta modulation. The pixel values in agiven column are preferably quantized as a whole, thereby minimizingaccumulated quantization error from a starting pixel value in the givencolumn to a current pixel value in the given column. The quantizedvalues for each column may be assembled into a rectangular array andstored as a final image in a non-transitory computer-readable medium.Alternatively or additionally, a final image may be reconstructed fromthe quantized pixel values of the matrix using a decoder, where thedecoder is configured to minimize an image norm during reconstruction.

In another aspect, a technique is presented for quantizing pixels of animage using a 2D generalization of Sigma Delta modulation. The methodincludes: receiving pixel values for an image captured by an imager at asequencing circuit; segmenting the pixel values of the matrix into oneor more patches of pixel values, where each patch of pixel values issubset of pixel values from the matrix arranged in a two dimensionalarray; and for each patch in the matrix, quantizing the pixel values ofa given patch using a two dimensional generalization of sigma deltamodulation.

In one embodiment, segmenting the pixel values of the matrix furtherincludes creating a sequence of pixel values for a given patch bysequencing pixel values along anti-diagonals of the two dimensionalarray starting from an upper left corner of the given patch and movingto the lower right corner of the given patch. The sequence of pixelvalues for a given patch are then quantized by summing quantizationerrors associated with at least three pixels neighboring pixels thegiven pixel and rounding sum to nearest member of an alphabet.

Further areas of applicability will become apparent from the descriptionprovided herein. The description and specific examples in this summaryare intended for purposes of illustration only and are not intended tolimit the scope of the present disclosure.

DRAWINGS

The drawings described herein are for illustrative purposes only ofselected embodiments and not all possible implementations, and are notintended to limit the scope of the present disclosure.

FIG. 1 is a diagram depicting an example imaging system;

FIG. 2 is a graph showing compression rate of D, D² and D³n exponentialfunctions with various frequencies;

FIG. 3 is a flowchart depicting a technique for quantizing pixels of animage;

FIG. 4A is a diagram illustrating how to sequence pixel vales of animage column wise and in series;

FIG. 4B is a diagram illustrating how to sequence pixel vales of animage column wise and in parallel;

FIG. 5 is a diagram illustrating how to sequence pixel values from apatch;

FIGS. 6A-6C are graphs showing a comparison of various 1D signalreconstruction results between the proposed encoder-decoder pairs andMSQ; and

FIGS. 7A and 7B are graphs showing the reconstruction result of signalsthat satisfy minimum separation condition.

Corresponding reference numerals indicate corresponding parts throughoutthe several views of the drawings.

DETAILED DESCRIPTION

Example embodiments will now be described more fully with reference tothe accompanying drawings.

FIG. 1 depicts an example imaging system 10. The imaging system 10 iscomprised generally of an imager 11, a sequencing circuit 12, ananalog-to-digital converter (ADC) 13 and a decoder 14. The imager 11 isconfigured to capture an image of a scene, where the image isrepresented by pixel values arranged in a matrix. In some embodiments,the imager is further defined as a CCD sensor or CMOS sensor.

Pixel values for the image are then quantized by the imaging system 10.The sequencing circuit 12 is configured to receive the pixel values forimage from the imager 11 and operates to segment and/or sequence thepixel values of the matrix into columns, rows, or patches as will befurther described below. The analog-to-digital converter 13 isinterfaced with the sequencing circuit 12 and quantizes the pixelvalues, for example using sigma delta modulation. In one embodiment, thequantized pixel values are assembled into a rectangular array and thearray is stored as a final image in a non-transitory computer-readablememory. In other embodiments, the quantized pixel values are optionallyreconstructed before being stored in the non-transitorycomputer-readable memory. In this case, the decoder 14 is interfacedwith the analog-to-digital converter 13 and reconstructs a final imagefrom the quantized pixel values. It is envisioned that other types ofsignal processing, such as JPEG compression, contrast adjustment, etc.,may be applied to the quantized pixel values as well.

In one example, the imaging system 10 is implemented as part of acamera. It is to be understood that only the relevant components of theimaging system 10 are discussed in relation to FIG. 1, but that othercomponents may be needed to control and manage the overall operation ofthe system.

By way of background, the existing quantization schemes are firstintroduced in the context of image quantization. Let X∈

^(N,N) store the pixel values of a grayscale image. Without anyambiguity, the image is referred to as matrix X.

For Memoryless Scalar Quantization (MSQ), suppose the alphabet

=

_(δ) is defined as in (1.1), the scalar quantization,

:

→

quantizes a given scalar by rounding it off to the nearest element inthe alphabet

(z) ∈ v − z

The Memoryless Scalar Quantization (MSQ) applies scalar quantization toeach sample of the input sequence independently. In terms of imagequantization, for a given image X, MSQ on X means quantizing each pixelindependently

^(N,N)

q=

(X), with q _(i,j)=

(X _(i,j)).

Here q_(i,j) and X_(i,j) are the (i, j)th entry of the quantized and theoriginal images q and X, respectively.

Sigma delta quantization is an adaptive quantization scheme proven to bemore efficient than MSQ in a variety of applications. The adaptivenesscomes from the fact that it utilizes quantization errors of previoussamples to increase the accuracy of the current sample. Suppose thesample sequence is y=(y₁, . . . , y_(m)), the first order ΣΔquantization q=

(y) is obtained by running the following iterations

$\begin{matrix}{{q_{i} = {( {y_{i} + u_{i - 1}} )}},{{( {Du} )_{i}:} = {{u_{i} - u_{i - 1}} = {y_{i} - {q_{i}.}}}}} & (1.2)\end{matrix}$

One can see from the first equation that one quantizes the i^(th) inputy_(i) by first adding to it the historical errors stored in theso-called state variable u_(i−1) then applying the scalar quantizaitonto the sum. In the second equation, the D is the forward finitedifference operator/matrix, with 1 s on the diagonal and −1 s on thesub-diagonal. Hence, the second equation defines a recurrence relationallowing one to update the state variable u_(i). The scheme defined inequation (1.2) is the so-called first order quantization scheme becauseit only uses one step of the historical error, u_(i−1). More generally,one can define the r^(th) order ΣΔ quantization denoted by q=

(y) by involving r steps of historical errors, u_(i−1), u_(i−1),u_(i−2), . . . , u_(i−r). More precisely, each entry q_(i) of q isobtained by

q_(i) = (ρ_(r)(u_(i − 1), …  u_(i − r)) + y_(i)), (D^(r)u)_(i): = y_(i) − q_(i),

where ρ_(r) is some general function aggregating the accumulated errorsu_(i−1), . . . u_(i−r). The r^(th) order finite difference operator isdefined via D^(r)u:=D(D^(r-1)u).

The main advantage of Sigma Delta quantization over MSQ is its adaptiveusage of the feedback information. The feedback information helps thequantizer to efficiently use the given bits to maximally storeinformation from special types of signals, such as those with lowfrequencies. Mathematically, these adaptive quantizers achieve highinformation conversion rate through the noise-shaping effect on thequantization error, which means the errors of such quantizers aredistributed non-uniformly. When inputting the same random sequence, theerror spectrum of MSQ is uniform, while that of Sigma Delta quantizationis (nearly) linear. This property has the following theoreticallyexplanation. First, the matrix form of definition (1.2) gives anexpression of the first order Sigma Delta quantization error y−q

y−q=Du, ∥u∥ _(∞)≤δ/2,

where δ is the quantization step-size in (1.1) and D is the finitedifference matrix. This is saying that y−q∈D(B_(∥·∥) _(∞) (δ/2)): thequantization error y−q lies in the

_(∞) ball of radius δ/2 reshaped by the operator D. In case of ther^(th) order ΣΔ quantization (r∈

₊), one would similarly have

y−q=D ^(r) u, ∥u∥ _(∞)≤δ/2,

which means the quantization error y−q lies in the

_(∞) ball reshaped by the operator D^(r).

The singular values of D^(r) determine the radii of the reshaped

_(∞) ball containing quantization errors. The singular vectors of D liealmost aligned with the Fourier basis, and the singular values of Dincreases with frequencies. Therefore, the low-frequency errorscorresponding to smaller singular values of D^(r) would be compressedmost. One can numerically verify this unbalanced error reduction effectof D^(r) on various sinusoidal frequencies by computing the ratio

${\rho_{r}(w)} = \frac{{{D^{r}e^{- {iwt}}}}_{2}}{{e^{{- i}{wt}}}_{2}}$

From the plot of ρ_(r)(w) in FIG. 2, one can see that the low frequencysinusoids lose more energy after going through D^(r) especially when thequantization order r is large.

Since Sigma Delta quantization can keep most error away from the lowfrequency, it is ideal for quantizing low-frequency signals. Forinstance, dense time samples of audio signals can be deemed as lowfrequency vectors, and they have indeed been shown to be a goodapplication of Sigma Delta quantization. Images, on the other hand, donot consist of only low frequencies, as sharp edges have pretty slowlydecaying Fourier coefficients. Therefore, it is not obvious whetherapplying Sigma Delta quantization to images is beneficial.

Despite the importance of quantization in image acquisition, MSQ isstill the state-of-the-art quantizer in commercial cameras. The majordrawback of MSQ is that when the bit-depth (i.e., the number of bitsused to represent each pixel) is small, it has a color-banding artifact,i.e., different colors merge together to cause fake contours andplateaus in the quantized image. A known technique called ditheringreduces color-banding by randomly perturbing the pixel values (e.g.,adding random noise) before quantization. It then breaks artificialcontour patterns into the less harmful random noise. However, thisrandom noise is still quite visible and a more fundamental issue is thatdithering only randomizes the quantization error instead of reducing it.The same amount of errors still exist in the quantized image and willmanifest themselves in other ways.

Another method to avoid color-banding is digital halftoning, firstproposed in the context of binary printing, where pixel values areconverted to 0 or 1 for printing leading to a possibly severecolor-banding artifact. To mitigate it, the digital half-toning wasproposed based on the ideas of sequential pixel quantization and errordiffusion. Error diffusion means the quantization error of the currentpixel will spread out to its neighbours to compensate for the overallunder/over-shooting. The decay of energy during spreading is set toempirical values that minimizes the overall

₂ quantization error of an entire image class. Error diffusion worksunder a similar assumption as the Sigma Delta quantization that theimage intensity is varying slowly and smoothly. In a sense, it tradescolor-richness with spatial resolution. As dithering, error fusion doesnot reduce the overall noise but only redistributes it.

From this discussion, one sees that both dithering and digitalhalftoning are only redistributing the quantization error instead ofcompressing it. In contrast, this disclosure introduces an improvedtechnique for quantizing pixels of an image. Explicitly, suppose N² isthe total number of pixels and s is the number of pixels representingcurve discontinuity (e.g., edges) in the image, the proposed techniquereduces the quantization error from 0(N) to 0(√s). This is achieved bycombining Sigma Delta quantization with an optimization basedreconstruction. It was observed in the numerical experiment that boththe low and high frequency errors are reduced.

FIG. 3 illustrates one embodiment for quantizing pixels of an image inaccordance wih this disclosure. As a starting point, an image of a sceneis captured at 31 by an imager or imaging device. The image isrepresented by pixel values arranged in a matrix.

When using a r^(th) order Sigma Delta quantization scheme on a 2D imageX, pixel values of the image X need to be converted into sequences asindicated at 32. One way to do this is applying Sigma Delta quantizationindependently to each column of the image,

q = (X),  q = [q₁, …  , q_(N)],  q_(j) = (X_(j)),  j = 1, …  , N.

where X_(j) is the j^(th) column of X. In other words, the pixel valuesfor each column are quantized as a whole, thereby minimizing accumulatedquantization error from a starting pixel value to the currentlyquantized pixel value in a given column of the image. In one example,the pixel values from the columns of the image are quantized in series(i.e., one column at a time) as shown in FIG. 4A. In another example,the pixel values from the columns of the image are quantized in parallelas shown in FIG. 4B. As opposed to column wise, it is readily understoodthat pixel values of the image could alternatively be quantized rowwise. In other words, the pixel values from a subset of pixels aregrouped together and then quantized, where the pixels in the subset ofpixels are neighbors to each other in the matrix.

With continued reference to FIG. 3, the sequence of pixel values is thenquantized at 33 using Sigma Delta modulation. In a first embodiment, 1DSigma Delta quantization is applied to each column of the image. A keyquestion one may ask is the practicality of the proposed adaptivequantizers on commercial cameras. A natural concern is the waiting time.Unlike MSQ that quantizes each pixel in parallel, Sigma Deltaquantization can only be performed sequentially, which seems toinevitably introduce extra waiting time. However, this is not the casebecause current cameras are already using sequential quantizationarchitectures for consistency, energy and size considerations. Morespecifically, in current cameras, to reduce the number of ADC (Analog toDigital Converters) and save energy, the whole image or a column ofpixels are assigned to one ADC, which means these pixels need to wait ina queue to be quantized anyway. Thus, to implement the proposedtechnique, an additional memory unit is added to the circuit of acamera.

Quantized values of the image are assembled into a rectangular array at35 and stored as a final image in a non-transitory computer-readablemedium. In some embodiments, the quantized values of the image mayoptionally be decoded at 34 using a decoder. The decoder is configuredto minmize an image norm (e.g., total variation norm) as will be furtherdescribed below.

Although one can apply 1D ΣΔ quantization column by column to an image,it is likely to create discontinuities across columns. As images aretwo-dimensional arrays, a two-dimensional quantization scheme seems morehelpful in maintaining continuity along both rows and columns. For asecond embodiment, this discloure proposes a 2D Sigma Delta quantizationwhich can be applied to one or more patches of an image.

In a nutshell, the key property that defined the first order 1D SigmaDelta quantization map

:

^(N)

y→q∈

^(N) (

is the alphabet) is that there exists a vector u∈

^(N), the state variable, such that q and u obey

(A1) (boundedness/stability): ∥u∥_(∞)≤C, for some constant C;

(A2) (adaptivity): u_(i)=u_(i−1)+y_(i)−q_(i), ∀i where u_(i), y_(i),q_(i) being the ith component of the vectors u, x and q, respectively;

(A3) (causality): q only depends on the history of x, that isq_(i)=f(y_(i), y_(i−1), . . . , y₁), for any i and some function f.

This scheme is extended to two dimensions as described below. Thequantization map in 2D is defined as

:

^(N,N)

y→q∈

^(N,N), the auxiliary variable u∈

^(N,N) and the properties (A1)-(A3) can be changed to

(A1′): ∥u∥_(max)≤C (∥·∥_(max) denotes the entry-wise maximum of amatrix)

(A2′): u_(i,j)=u_(i,j−1)+u_(i−1,j)−u_(i−1,j−1)+y_(i,j)−q_(i,j) which hasa matrix representation DuD^(T)=y−q, and

(A3′): q_(i,j)=f({y_(i′,j′)}_(i′≤i,j′≤j)).

Provided that the quantization alphabet is large enough, one can showthat the u and q that satisfy (A1′)-(A3′) can be constructed through therecursive updates

q _(i,j)=

(u _(i,j−1) +u _(i−1,j) −u _(i−1,j−1) +y _(i,j)),  (2.1)

u _(i,j) =u _(i,j−1) +u _(i−1,j) −u _(i−1,j−1) +y _(i,j) −q _(i,j).

Note that the first row and column are initialized exactly the same asthe 1D 1^(st) order sigma delta quantization. With a 1-bit alphabet,there might not exist a pair of u and q obeying (A1)-(A3). With a two ormore bit alphabet, the following proposition ensures the existence of astable 2D Sigma Delta quantization.

Proposition 2.1—For a given 2D array y∈[a, b]^(N,N) and bit depth d≥2,there exists an alphabet

such that u and q generated by (2.1) satisfies (A1′)-(A3′) with

$C = {\frac{b - a}{2( {2^{d} - 3} )}.}$

Proof for this propositon is as follows. Without loss of generality,assume for all 1≤i,j≤N, a≤y_(i,j)≤b, with some constant a≤b. Let

${C = \frac{b - a}{2( {2^{d} - 3} )}},$

create the alphabet as

={a−2C,a,a+2C, . . . , b,b+2C}.

Then

${\mathcal{A}} = {{\frac{b + {2C} - ( {a - {2C}} )}{2C} + 1} = {2^{d}.}}$

Next, use the second principle of induction to show that u generated by(2.1) satisfies ∥u∥_(∞)≤C.

Induction hypothesis: if for all the pairs (m, n) such that m≤i, n≤j,m+n<i+j, |u_(m,n)|≤C, then |u_(i,j)|≤C.

Base case: |u_(1,1)|=|y_(1,1)−q_(1,1)|=|y_(1,1)−

(y_(1,1))|≤C.

Induction step: if i=1, q_(i,j)=

(y_(i,j)+u_(i,j−1)), by induction hypothesis a−C≤y_(i,j)+u_(i,j−1)≤b+C,thus |u_(i,j)|=|y_(i,j)+u_(i,j−1)−q_(i,j)|≤C. The same reasoning followswhen j=1.

If i,j≥2, by induction hypothesisa−3C≤y_(i,j)+u_(i,j−1)+u_(i−1,j)−u_(i−1,j−1)≤b+3C, we also have|u_(i,j)|=|y_(i,j)+u_(i,j−1)+u_(i−1,j)−u_(i−1,j−1)−q_(i,j)|≤C.Next, we show that the stability constant

${C = \frac{b - a}{2( {2^{d} - 3} )}},$

corresponding to the uniform alphabet

A={a−2C,a,a+2C, . . . , b,b+2C}

used in the above proposition is optimal. Thus, it has been shown thatfor each patch of pixels comprising an image, pixel values of a givenpatch can be quantized using a two dimensional generalization of sigmedelta modulation.

Different techniques for sequencing pixels for two dimensional sigmadelta quantization are contemplated by this disclosure. One example forsequencing pixels is illustrated in FIG. 5. In this example, a sequenceof pixel values is created by sequencing pixel values along antidiagonalof the matrix starting from an upper left corner of the given patch andmoving to the lower right corner of the given patch. Within eachantidiagonal, the sequence may go from lower left to upper right as seenin FIG. 5. That is, the sequence of pixel values is P11, P21, P12, P31,P22, P13, P32, P23 and P33. Alternatively, within each antidiagonal, thesequence may go from upper right to lower left. That is, the sequence ofpixel values is P11, P12, P21, P13, P22, P31, P23, P32 and P33. Ineither case, the sequence of pixels may be quantized in series by an ADCas seen in FIG. 5. In other embodiments, pixels along each antidiagonalcan be quantized in parallel. It is also envisioned that pixels may besequenced and quantized along diagonals of the matrix as well. Theseseqeuncing techniques are applied to each patch from the image.

Proposition 2.2—For fixed bit depth d≥2, the alphabet

for 2D ΣΔ quantization given in Proposition 2.1 is optimal, in the sensethat let {tilde over (C)} be the stability constant of any other d-bitalphabet Ã (not necessarily equal-spaced), then it is necessary that{tilde over (C)}≤C. To prove by contradiction, assume there exists ad-bit alphabet Ã whose stability constant is smaller, i.e., {tilde over(C)}<C. Let the alphabet Ã be c₁<c₂< . . . <c_(n), with N=2^(d). Assumec₁< . . . <c_(i)<a≤c_(i+1)< . . . <c_(j)≤b<c_(j+1)< . . . <C_(N), notethat there is no restriction on the range of alphabet, that is, it ispossible that a≤c₁ or b≥c_(N). Also, denote the largest interval lengthin the alphabet within [a, b] as 2Ĩ=max{2I, b−c_(j)} and2I=max{c_(i+2)−c_(i+1), . . . , c_(j)−c_(j−1)}. The case d≥3 is easierthan d=2, so first prove the case d≥3.

For d≥3, start by proving that there are at least two elements in thealphabet that are within [a, b] so I is well defined. Notice that

${C = {\frac{b - a}{2( {2^{d} - 3} )} < \frac{b - a}{10}}},$

if there is zero or only one

between a, b, i.e., a≥

≥b, then one can choose a≤y_(1,1)≤b properly such that

${{{y_{1,1} - {Q_{A}( y_{1,1} )}}} \geq \frac{b - a}{4} > C > \overset{\sim}{C}},$

which leads to a contradiction.

Next, consider the following cases. In this first case, a≤c₁ or b≥c_(N).Under this assumption one of the following two cases must hold: 1) a≤c₁and c₁−a≥b−c_(N) or 2) b−c_(N)>c₁−a. A closer look indicates that thesetwo cases are exactly the same upon exchanging the roles of a and bhence they share the same proof. Without loss of generality, assume case1 hold: a≤c₁ and c₁−a≥b−c_(N). Next, specify the following sub-cases:

(a) c_(N)≤b, let [

] be the largest interval in Ã for some

, i.e.,

−

=2I. Choose

${y_{1,1} = {\frac{c_{\ell} + c_{\ell + 1}}{2} - \epsilon}},{y_{1,2} = {y_{2,1} = {c_{\ell} + {2\epsilon}}}}$

with small enough ϵ such as 10⁻¹⁰ (C−{tilde over (C)}) and y_(2,2)=a,this leads to u_(1,1)=I−ϵ, u_(1,2)=u_(2,1)=−I+ϵ, thenq_(2,2)=Q_(A)(y_(2,2)+u_(1,2)+u_(2,1)−u_(1,1))=Q_(A)(a−3I+3ϵ), thequantization error

${u_{2,2}} = {{{y_{2,2} + u_{1,2} + u_{2,1} - u_{1.1} - q_{2,2}}} = {{{{Q_{A}( {a - {3I} + {3\epsilon}} )} - ( {a - {3I} + {3\epsilon}} )}} = {{{c_{1} - a + {3I} - {3\epsilon}} \geq {c_{1} - a + {\frac{3}{2} \cdot \frac{c_{N} - c_{1}}{2^{d} - 1}} - {3\epsilon}} \geq {c_{1} - a + {\frac{3}{2} \cdot \frac{b - a - {2( {c_{1} - a} )}}{2^{d} - 1}} - {3\epsilon}} \geq {\frac{3( {b - a} )}{2( {2^{d} - 1} )} - {3\epsilon}} \geq {\frac{b - a}{2( {2^{d} - 3} )} - {3\epsilon}}} = {{C - {3\epsilon}} > {\overset{\sim}{C}.}}}}}$

The second inequality used the assumption c₁−a≥b−c_(N), the third oneused c₁−a≥0 and d≥3. Then this contradicts the assumption∥u∥_(max)≤{tilde over (C)}.

(b) c_(N)>b and c_(j+1)−c_(j)<3(b−c_(j)). If 2Ĩ=max{2I,b−c_(j)}=b−c_(j),let

${y_{1,1} = {c_{j} + \overset{\sim}{I} - \epsilon}},{y_{1,2} = {y_{2,1} = {{\frac{c_{j + 1} + c_{j}}{2} - \overset{\sim}{I} + {2\epsilon}} \leq b}}}$

with sufficiently small ϵ as in (a) and y_(2,2)=a, then

${u_{1,1} = {\overset{˜}{I} - \epsilon}},{u_{1,2} = {u_{2,1} = {{{- \frac{c_{j + 1} - c_{j}}{2}} + \epsilon} < {{- \overset{˜}{I}} + {\epsilon.}}}}}$

If 2Ĩ=2I, one can choose y_(1,1),y_(1,2),y_(2,1) as in (a) such thatu_(1,1)=Ĩ−ϵ, u_(1,2)=u_(2,1)=−Ĩ+ϵ. In both cases, let y_(2,2)=a, havey_(2,2)+u_(1,2)u_(2,1)−u_(1,1)≤a−3I+3ϵ<c₁,q_(2,2)=Q_(A)(y_(2,2)+u_(1,2)+u_(2,1)u_(1,1))=c₁, the quantization errorat q_(2,2) is

$\begin{matrix}{{u_{2,2}} = {{c_{1} - ( {y_{2,2} + u_{1,2} + u_{2,1} - u_{1,1}} )}}} \\{\geq {c_{1} - a + {3\overset{˜}{I}} - {3\epsilon}}} \\{\geq {c_{1} - a + {\frac{3}{2} \cdot \frac{b - c_{1}}{2^{d} - 1}} - {3\epsilon}}} \\{\geq {c_{1} - a + {\frac{3}{2} \cdot \frac{b - a - ( {c_{1} - a} )}{2^{d} - 1}} - {3\epsilon}}} \\{\geq {\frac{3( {b - a} )}{2( {2^{d} - 1} )} - {3\;\epsilon}}} \\{\geq {\frac{b - a}{2( {2^{d} - 3} )} - {3\;\epsilon}}} \\{= {{C - {3\;\epsilon}} > {\overset{\sim}{C}.}}}\end{matrix}$

This leads to a contradiction.

(c) c_(N)>b and c_(j+1)−c_(j)≥3(b−c_(j)). If one choose

$y_{1,1} = {\frac{b + c_{j}}{2} - \epsilon}$

with some small ϵ and y_(1,2)=b, then

${u_{1,1} = {\frac{b - c_{j}}{2} - \epsilon}},{u_{1,2} = {{{\frac{3}{2}( {b - c_{j}} )} - \epsilon} \leq {\overset{˜}{C}.}}}$

Since it holds for arbitrary small ϵ, it must have b−c_(j)≤⅔{tilde over(C)}. This gives b−⅔{tilde over (C)}≤c_(j)≤b and

${{2I} \geq \frac{c_{j} - c_{1}}{j - 1} \geq \frac{b - {\frac{2}{3}\overset{\sim}{C}} - c_{1}}{2^{d} - 2}},$

where the last inequality is due to the assumption c_(N)>b so thatj≤N−1. Same as in (a), one can choose y_(1,1), y_(1,2), y_(2,1) properlyand y_(2,2)=a, such that u_(1,1)=I−ϵ, u_(1,2)=u_(2,1)=−I+ϵ, providedthat ϵ is small enough. Then the quantization error atq_(2,2)=Q(a−3I+3ϵ)=c₁ is

$\begin{matrix}{{u_{2,2}} = {{{Q_{A}( {a - {3I} + {3\epsilon}} )} - ( {a - {3I} + {3\epsilon}} )}}} \\{= {c_{1} - a + {3I} - {3\epsilon}}} \\{\geq {c_{1} - a + {\frac{3}{2} \cdot \frac{b - {\frac{2}{3}\overset{˜}{C}} - c_{1}}{2^{d} - 2}} - {3\epsilon}}} \\{\geq {c_{1} - a + {\frac{3}{2} \cdot \frac{b - a - \frac{b - a}{3( {2^{d} - 3} )} - ( {c_{1} - a} )}{2^{d} - 2}} - {3\epsilon}}} \\{\geq {{\frac{3}{2} \cdot \frac{b - a - \frac{b - a}{3( {2^{d} - 3} )}}{2^{d} - 2}} - {3\epsilon}}} \\{\geq {\frac{b - a}{2( {2^{d} - 3} )} - {3\epsilon}}} \\{= {{C - {3\epsilon}} > {\overset{\sim}{C}.}}}\end{matrix}$

This also leads to a contradiction.

In the next case, a>c₁ and b<c_(N). If a>c₂ and b<c_(N−1), then one caneasily choose a proper a≤y_(1,1)≤b with quantization error at least

${{{\max\{ {{2I},{c_{i + 1} - a},{b - c_{j}}} \}} \geq \frac{b - a}{2( {2^{d} - 4 + 1} )}} = {C > \overset{˜}{C}}},$

which leads to contradiction. Therefore, without loss of generality,assume c₁<a≤c₂ and c_(2−a)≥b−c_(N−1), similar as above, specify thefollowing sub-cases:

(d) c_(N−1)≤b. For arbitrary constant a−3I<ξ<c₂, one can choosey_(1,1),y_(1,2),y_(2,1),y_(2,2) properly such thatξ=y_(2,2)+u_(1,2)+u_(2,1)−u_(1,1). If a≤ξ<c₂, sety_(1,1)=y_(1,2)y_(2,1)=c₂, y_(2,2)=ξ, then u_(1,1)=u_(1,2)=u_(2,1)=0,y_(2,2)+u_(1,2)+u_(2,1)−u_(1,1)ξ. If a−3I <ξ<a, denote

${W = \frac{a - \xi}{3}},$

then 0<w<I. Let [

] be the largest interval in Ã for some

, i.e.,

−

. =2I. Choose y_(1,1)=

+w,y_(1,2)=y_(2,1)=

−2w, y_(2,2)=a, then u_(1,1)=w, u_(1,2)=u_(2,1)=w, one hasy_(2,2)+u_(1,2)+u_(2,1)−u_(1,1)=a−3w=ξ. Hence whatever c₁ is, we canalways obtain the quantization error

$\begin{matrix}{{\max\limits_{{a - {3I}} < \xi < c_{2}}{{\xi - {Q_{A}(\xi)}}}} \geq {\frac{1}{3}( {c_{2} - ( {a - {3I}} )} )}} \\{= {{\frac{1}{3}( {c_{2} - a} )} + I}} \\{\geq {{\frac{1}{3}( {c_{2} - a} )} + {\frac{1}{2} \cdot \frac{c_{N - 1} - c_{2}}{2^{d} - 3}}}} \\{\geq {{\frac{1}{3}( {c_{2} - a} )} + {\frac{1}{2} \cdot \frac{b - a - {2( {c_{2} - a} )}}{2^{d} - 3}}}} \\{\geq {\frac{b - a}{2( {2^{d} - 3} )} + {( {\frac{1}{3} - \frac{1}{2^{d} - 3}} )( {c_{2} - a} )}}} \\{\geq \frac{b - a}{2( {2^{d} - 3} )}} \\{= {C > {\overset{\sim}{C}.}}}\end{matrix}$

This leads to a contradiction.

(e) c_(N−1)>b and c_(j+1)−c_(j) 3/2(b−c_(j)). Similar as in (d), firstshow that one can choose between y_(1,1),y_(1,2),y_(2,1),y_(2,2)properly to make y_(2,2)+u_(1,2)+u_(2,1)−u_(1,1) arbitrary constantbetween a−3Ĩ and c₂. If 2Ĩ=2I, it follows the same reasoning as in (d),here we discuss the case when 2Ĩ=b−c_(j). If a≤ξ<c₂, lety_(1,1)=y_(1,2)=y_(2,1)=c_(j), y_(2,2)=ξ, theny_(2,2)+u_(1,2)+u_(2,1)−u_(1,1)=ξ. If a−3Ĩ<ξ<a, denote w=a−ξ, then0<w<3Ĩ, we specify the following sub-cases: i) if 0<w≤Ĩ, lety_(1,1)=c_(j)+w,y_(1,2)=y_(2,1)=c_(j)−w, y_(2,2)=a, then u_(1,1)=w,u_(1,2)=u_(2,1)=0, y_(2,2)+u_(1,2)u_(2,1)−u_(1,1)=a−w=ξ; ii) if I<w≤2Ĩ,let y_(1,1)=c_(j)+w−Ĩ, y_(1,2)=c_(j+1)−w, y_(2,1)=c_(j)−(w−Ĩ) andy_(2,2)=a, then u_(1,1)=w−Ĩ, u_(1,2)=−Ĩ, u_(2,1)=0,y_(2,2)+u_(1,2)+u_(2,1)−u_(1,1)=a−w=ξ, iii) if 2Ĩ<w<3Ĩ, lety_(1,1)=c_(j)+w−2Ĩ, y_(1,2)=y_(2,1)=c_(j+1)−w+Ĩ and y_(2,2)=a, thenu_(1,1)=w−2Ĩ, u_(1,2)=u_(2,1)=−Ĩ, y_(2,2)+u_(1,2)+u_(2,1)−u_(1,1)=a−w=ξ.

Therefore, whatever c₁ is, the worst case quantization error can reach

$\begin{matrix}{{\max\limits_{{a - {3\overset{\sim}{I}}} < \xi < c_{2}}{{\xi - {Q_{A}(\xi)}}}} = {\frac{1}{3}( {c_{2} - ( {a - {3\overset{\sim}{I}}} )} )}} \\{= {{\frac{1}{3}( {c_{2} - a} )} + \overset{\sim}{I}}} \\{\geq {{\frac{1}{3}( {c_{2} - a} )} + {\frac{1}{2} \cdot \frac{b - c_{2}}{2^{d} - 3}}}} \\{\geq {{\frac{1}{3}( {c_{2} - a} )} + {\frac{1}{2} \cdot \frac{b - a - ( {c_{2} - a} )}{2^{d} - 3}}}} \\{\geq {\frac{b - a}{2( {2^{d} - 3} )} + {( {\frac{1}{3} - \frac{1}{2( {2^{d} - 3} )}} )( {c_{2} - a} )}}} \\{\geq \frac{b - a}{2( {2^{d} - 3} )}} \\{= {C > {\overset{\sim}{C}.}}}\end{matrix}$

This leads to a contradiction.

(f) c_(N−1)>b and c_(j+1)−c_(j)> 3/2(b−c_(j)), similar as (c), one hasj≤N−2 and b−c_(j)≤ 4/3{tilde over (C)}, then

$\begin{matrix}{{\max\limits_{{a - {3I}} < \xi < c_{2}}{{\xi - {Q_{A}(\xi)}}}} = {\frac{1}{3}( {c_{2} - ( {a - {3I}} )} )}} \\{= {{\frac{1}{3}( {c_{2} - a} )} + I}} \\{\geq {{\frac{1}{3}( {c_{2} - a} )} + {\frac{1}{2} \cdot \frac{b - {\frac{4}{3}\overset{\sim}{C}} - c_{2}}{2^{d} - 4}}}} \\{\geq {{\frac{1}{3}( {c_{2} - a} )} + {\frac{1}{2} \cdot \frac{b - a - \frac{2( {b - a} )}{3( {2^{d} - 3} )} - ( {c_{2} - a} )}{2^{d} - 4}}}} \\{\geq {\frac{b - a}{2( {2^{d} - 3} )} + {( {\frac{1}{3} - \frac{1}{2( {2^{d} - 4} )}} )( {c_{2} - a} )}}} \\{\geq \frac{b - a}{2( {2^{d} - 3} )}} \\{= {C > {\overset{\sim}{C}.}}}\end{matrix}$

There also leads to a contradiction.

For the case d=2, there are only 4 elements in the alphabet Ã={c₁, c₂,c₃, c₄} with c₁<c₂<c₃<c₄. Consider the case a≤c₁ or b≥c₄, if there areat least two elements in Ã that are within [a, b], the proof follows thesame reasoning as d≥3, which has been discussed in (a)-(c). Here wediscuss the case that a≤c₁ and there is only one element in the Ã thatis within [a, b], i.e., a≤c₁≤b<c₂<c₃<c₄. In this case, let

$\mspace{79mu}{{y_{1,1} = \frac{c_{1} + b}{2}},{y_{1,2} = {y_{2,1} = {y_{2,2} = a}}},\mspace{79mu}{{{then}\mspace{14mu} u_{1,1}} = \frac{b - c_{1}}{2}},{u_{1,2} = {u_{2,1} = {a - c_{1}}}},{u_{2,2} = {{a + {2( {a - c_{1}} )} - \frac{b - c_{1}}{2} - {Q_{A}( {a + {2( {a - c_{1}} )} - \frac{b - c_{1}}{2}} )}} = {{{{2( {a - c_{1}} )} - \frac{b - c_{1}}{2}} \leq {\frac{a - c_{1}}{2} - \frac{b - c_{1}}{2}}} = {- C}}}}}$$\mspace{79mu}{{{{{hence}\mspace{14mu}{u_{2,2}}} \geq \frac{b - a}{2}} = {C > \overset{\sim}{C}}},}$

which leads to a contradiction.

Next, the two remaining cases are discussed when both a>c₁ and b<c₄hold: c₁<a≤c₂≤b<c₃<c₄ and c₁<a≤c₂<c₃≤b<c₄, which corresponds to twocases that there are 1 or 2 elements in the alphabet between a and b,respectively.

For c₁<a≤c₂<c₃≤b<c₄, without loss of generality, assume c₂+c₃≥b+a. Notethat one must have c₂−c₁<b−a since

${{\overset{\sim}{C} < C} = \frac{b - a}{2}},$

combining these two inequalities get c₁+c₃>2a, then a<½(c₁+c₃)<b. Choosesome small ϵ and the first 3×3 entries of y as

$y = {\begin{pmatrix}{{\frac{1}{2}( {c_{3} + c_{2}} )} - \epsilon} & c_{2} & {c_{2} + {2\epsilon}} & \ldots \\c_{2} & c_{2} & {\frac{1}{2}( {c_{1} + c_{3}} )} & \ldots \\{c_{2} + {2\;\epsilon}} & {\frac{1}{2}( {c_{1} + c_{3}} )} & a & \ldots \\\ldots & \ldots & \ldots & \ldots\end{pmatrix}.}$

Provided that ϵ is small enough, one can check that the first 3×3entries in u are as follows

$u = \begin{pmatrix}{{\frac{1}{2}( {c_{3} - c_{2}} )} - \epsilon} & {{\frac{1}{2}( {c_{3} - c_{2}} )} - \epsilon} & {{{- \frac{1}{2}}( {c_{3} - c_{2}} )} + \epsilon} & \cdots \\{{\frac{1}{2}( {c_{3} - c_{2}} )} - \epsilon} & {{\frac{1}{2}( {c_{3} - c_{2}} )} - \epsilon} & {{{- \frac{1}{2}}( {c_{2} - c_{1}} )} + \epsilon} & \cdots \\{{{- \frac{1}{2}}( {c_{3} - c_{2}} )} + \epsilon} & {{{- \frac{1}{2}}( {c_{2} - c_{1}} )} + \epsilon} & {a - {\frac{1}{2}( {c_{3} + c_{2}} )} + {3\epsilon}} & \cdots \\\cdots & \cdots & \cdots & \cdots\end{pmatrix}$

By assumption, one has c₂+c₃≥b+a, then for small enough ϵ,

${{{u_{3,3}} \geq {\frac{b - a}{2} - {3\epsilon}}} = {{C - {3\epsilon}} > \overset{\sim}{C}}},$

this leads to a contradiction.

For c₁<a≤c₂<b≤c₃<c₄, specify the following two cases:

(i)

${c_{2} \geq \frac{b + a}{2}},$

notice that c₂−c₁<b−a, so c₁+c₂>2a, one can choose y_(1,1)=c₂,y_(1,2)=y_(2,1)=½(c₁+c₂)+ϵ for sufficiently small ϵ and y_(2,2)=a, thenu_(1,1)=0, u_(1,2)=u_(2,1)=−½(c₂−c₁)+ϵ andQ_(A)(y_(2,2)+u_(1,2)+u_(2,1)−u_(1,1))=Q_(A)(a−(c₂−c₁)+2ϵ)=c₁ and

${{{y_{2,2} + u_{1,2} + u_{2,1} - u_{1,1} - {Q_{A}( {y_{2,2} + u_{1,2} + u_{2,1} - u_{1,1}} )}}} = {{c_{2} - a - {2\epsilon}} \geq {\frac{b - a}{2} - {2\epsilon}} > \overset{\sim}{C}}},$

which leads to a contradiction.

(ii)

${c_{2} < \frac{b + a}{2}},$

also notice that c₃−c₂<b−a, then one can choose y as follows with somesmall ϵ,

$y = \begin{pmatrix}{{\frac{1}{2}( {b + c_{2}} )} - \epsilon} & c_{2} & {c_{2} + {\frac{1}{2}( {c_{3} - b} )} + {2\epsilon}} & \cdots \\c_{2} & c_{2} & {\frac{1}{2}( {c_{1} + c_{3}} )} & \cdots \\{c_{2} + {\frac{1}{2}( {c_{3} - b} )} + {2\epsilon}} & {\frac{1}{2}( {c_{1} + c_{3}} )} & a & \cdots \\\cdots & \cdots & \cdots & \cdots\end{pmatrix}$

Provided that ϵ is small enough, the corresponding u is

$U = \begin{pmatrix}{{\frac{1}{2}( {b - c_{2}} )} - \epsilon} & {{\frac{1}{2}( {b - c_{2}} )} - \epsilon} & {{{- \frac{1}{2}}( {c_{3} - c_{2}} )} + \epsilon} & \cdots \\{{\frac{1}{2}( {b - c_{2}} )} - \epsilon} & {{\frac{1}{2}( {b - c_{2}} )} - \epsilon} & {{{- \frac{1}{2}}( {c_{2} - c_{1}} )} + \epsilon} & \cdots \\{{{- \frac{1}{2}}( {c_{3} - c_{2}} )} + \epsilon} & {{{- \frac{1}{2}}( {c_{2} - c_{1}} )} + \epsilon} & {{- \frac{c_{2} + b}{2}} + a + {3\epsilon}} & \cdots \\\cdots & \cdots & \cdots & \cdots\end{pmatrix}$

Since c₂≥a, then

${{u_{3,3}} = {{\frac{c_{2} + b}{2} - a - {3\epsilon}} \geq {\frac{b - a}{2} - {3\epsilon}} > \overset{\sim}{C}}},$

which leads to a contradiction.

The quanitzation time of 2D scheme (2.1) is 0(N). Because for a fixed t∈{2,3, . . . , 2N}, all u_(i,j) with i+j=t (the points on ananti-diagonal) can be computed in parallel. The matrix representation of(2.1) is

y−q=DuD ^(T).

It is easy to extend the first order quantization to high orders. Ifr≥1, the r-th order quantization obey the matrix form recursive formula

y−q=D ^(r) u(D ^(r))^(T).

Throughout, it has been assumed that the image X to be quantized andreconstructed is a N×N matrix (the derivation for rectangle matrices aresimilar), X=(x ₁,x ₂, . . . , x _(N))=(x ₁, x ₂, . . . , x _(N))^(T) isthe column-wise and row-wise decomposition of X, D is the N×N differencematrix with 1 s on the diagonal and −1 s on the sub-diagonal and D₁ isthe circulant difference matrix with an extra −1 on the upper rightcorner. Denote ∥·∥₁ as the entry-wise

_(l)-norm, and ∥·∥_(∞) refers to the entry-wise

_(∞)-norm. Also,

_(k) denotes the discrete Fourier transform operator with frequency k, Fis the N×N DFT matrix. Let F_(L) contain rows of F with frequencieswithin {L, −L+1, . . . , L}, and P_(L)=F_(L)*F_(L).

A general assumption is that the images satisfy some sparsity propertyin its gradients. To be more precise, consider three classes of imageseach satisfy one of the following three assumptions.

Assumption 2.1 (β^(th) order sparsity condition) Suppose X∈

^(N,N) is an image, the columns or rows of X are piece-wise constant orpiece-wise linear. Explicitly, the cardinality of 1^(st) order or 2^(nd)order differences in each column or row is smaller than the number ofpixels: for β=1 or 2, fix s<N,

∥(D ^(β))^(T) x _(i)∥₀ ≤s,∀j=1,2, . . . , N, or ∥ x _(j) ^(T) D ^(β)∥₀≤s,∀j=1,2, . . . , N.

If β=1, the columns or rows of image X is piece-wise constant, if β=2,they are piece-wise linear.

Assumption 2.2—Both columns and rows in X are piece-wise constant orpiece-wise linear. Explicitly, for β=1 or 2, fix s<2N²,

∥(D ^(β))^(T) X∥₀+∥XD ^(β)∥₀ ≤s.

Assumption 2.3−(β^(th) order minimum separation condition) X satisfiesAssumption 2.1. In addition, the β^(th) order differences of X in eachcolumn or row satisfy the Λ_(m)-minimum separation condition definedbelow for some small constant M<<N. Explicitly, this means for β=1 or 2,{D₁ ^(β) x _(i)}_(i=1,2, . . . , N) or {x _(j) ^(T)(D₁^(β))^(T)}_(j=1,2, . . . , N) satisfy Λ_(M)-minimum separationcondition, note that here D₁ is the circulant difference matrix.

Definition 2.1 (Λ_(M)-minimum)—For a vector x∈

^(N), let S⊏ {1,2, . . . , N} be its support set, say that it satisfiesΛ_(M)-minimum separation condition if

$\begin{matrix}{ {\min\limits_{s,{s^{\prime} \in T},{s \neq s^{\prime}}}\frac{1}{N}} \middle| {s - s^{\prime}} \middle| {\geq \frac{2}{M}} ,} & (2.2)\end{matrix}$

where |·| is the wrap-around distance. Use the definition C(T, Λ_(M)) asthe space of trigonometric polynomials of degree M on set T, i.e.,

C(T, Λ _(M))={f∈C ^(∞)(T): f(x)=Σ_(k=-M) ^(M) a _(k) e ^(i2πkx)}.

The proposed decoders and their error bounds.

Let Q be encoder Q_(col) or Q_(2D) which will be specified in each case,and X be the image. The proposed decoders for images satisfyingdifferent assumptions can be unified in the following framework

{circumflex over (X)}=arg min_(z) f(Z,β)st. ρ(Z,r)≤c.  (2.3)

Here β is 1 or 2 depending on whether the image is assumed to bepiece-wise constant or piece-wise linear, f(Z,β) is some loss functionthat encourages sparsity in the gradient under various assumptions, r isthe order of Sigma Delta quantization, and ρ(Z,r)≤c is the feasibilityconstraint determined by the quantization scheme. Under this framework,let {circumflex over (X)} be a solution of (2.3), one can obtainreconstruction error bounds of the following type

∥{circumflex over (X)}−X∥_(F) ≤C(β,r,N,δ),  (2.4)

where N is the size of the image, and δ is the alphabet step-size.

Now specify the explicit form of the optimization framework and errorbound for each class of images.

Class 1: X satisfies Assumption 2.1 with order β=1 or 2 and sparsity s,the encoder is r^(th) order (r≥β) Q_(col) with an alphabet step-size δ,use the following optimization for reconstruction with

{circumflex over (X)}=arg min_(z)∥(D ^(β))^(T) Z∥ ₁st.∥D ^(−r)(Z−Q_(col)(X))∥_(∞)≤δ/2.  (2.5)

Theorem 3.1 shows that the reconstruction error is

∥{circumflex over (X)}−X∥ _(F) ≤C√{square root over (sN)}δ.

Class 2: X satisfies Assumption 2.2 with order β=1 and sparsity s, theencoder is Q_(2D) with alphabet step-size δ, use the followingoptimization

{circumflex over (X)}arg min_(z) ∥D ^(T) Z∥ ₁ +∥ZD∥ ₁st.∥D ⁻¹(Z−Q_(2D)(X))D ^(−T)∥_(∞)≤δ/2.  (2.6)

Theorem 3.3 shows that the reconstruction error is bounded by

∥{circumflex over (X)}−X∥ _(F) ≤C√{square root over (s)}δ.

Class 3: X satisfies Assumption 2.3 with order β=1 or 2 and sparsitys«N, the encoder is r^(th) order Sigma Delta quantization applied toeach column: Q_(col) with alphabet spacing δ, r≥β. Here we define a newalphabet Ã with smaller step-size

$\overset{\sim}{\delta}:=\frac{2\delta}{( {2N} )^{\; r}}$

to quantize the last r entries in each column:

Ã:={a,a+{tilde over (δ)},a+2{tilde over (δ)} . . . , a+K{tilde over(δ)},b}, K=max {j,a+j{tilde over (δ)}<b}.

The total number of boundary bits is of order 0(logN), which isnegligible comparing to the 0(N) bits needed for the interior pixels.Hence the following feasibility contraint holds:

${{D^{- r}( {X - {Q_{col}(X)}} )}_{{N - {r:{N - 1}}},:}}_{\infty} \leq {( \frac{1}{2N} )^{r}{\delta.}}$

Then we use the following optimization to obtain the reconstructed image{tilde over (X)}:

$\begin{matrix}{{\overset{\sim}{X} = {\arg_{Z}\min{{D_{1}^{\beta}Z}}_{1}}}{{subject}\mspace{14mu}{to}\{ \begin{matrix}{{{{D^{- r}( {Z - {Q_{col}(X)}} )}}_{\infty} \leq \frac{\delta}{2}},} \\{{{D^{- r}( {Z - {Q_{col}(X)}} )}_{{{{N - {r:{N - 1}}},:}}\infty} \leq {( \frac{1}{2N} )^{r}{\delta.}}}}\end{matrix} }} & (2.7)\end{matrix}$

Here D^(−r)(Z−Q_(col)(X))_(N−r:N−1:) refers to the last r rows of D^(−r)(Z−Q_(col)(X)). The error bound is

${{\overset{\sim}{X} - X}}_{F} \leq {C\frac{M^{r + \beta - 2}}{N^{r - 3}}{\delta.}}$

In MSQ, the quantization error for each pixel is δ/2. Since the pixelsare quantized independently, the total quantization error of the N×Nimage in Frobenius norm is Nδ/2. Similarly, when using ΣΔ quantizers(Q_(col) or Q_(2D)) and decoding with the following naive decoder,

{circumflex over (X)}=Find Z st.∥D ^(−r)(Z−Q _(col)(X))∥_(∞)≤δ/2,

the worse-case error is again 0(Nδ). This indicates that the TV normpenalty in the proposed decoders (2.5), (2.7) or (2.6) are playing a keyrole in reducing the error to 0(√{square root over (sN)}δ) and0(√{square root over (s)}δ), respectively.

First, consider images with no minimum separation (Class 1), where theimage X satisfies Assumption 2.1 and the encoder is Q_(col) column bycolumn quantization. With this encoder, the decoder (2.5) can bedecoupled into columns, with the reconstructions done in parallel.

For each column x∈

^(N), let q be its r^(th) order Sigma Delta quantization, i.e.,q=Q^(ΣΔ,r)(x), r≥β,β=1,2. Then the decoder (2.5) reduces to

{circumflex over (x)}=arg min_(z)∥(D ^(β))^(T) z∥ ₁ subject to ∥D^(−r)(z−q)∥_(∞)≤δ/2.  (3.1)

Here D is the finite difference matrix and δ is the quantizationstep-size. Therefore (D^(β))^(T) z represents the 1^(st) order or 2^(nd)order discrete derivatives in z. The

₁-norm is used to promote the sparsity of the derivatives correspondingto edges. The ball-constraint was a well known feasibility constraintfor Sigma Delta quantization.

The following theorem provides the error bound of this decoder.

Theorem 3.1—For first order or second order Sigma Delta quantization,i.e., β=1 or 2, r≥β, assume the support of (D^(β))^(T)x has cardinalitys, and {circumflex over (x)} is a solution to (3.1), then

∥{circumflex over (x)}−x∥₂ ≤C√{square root over (s)}δ.  (3.2)

Remark 3.2—The above error bound is for each column. Putting the errorof all columns together as {circumflex over (X)}, one has

∥{circumflex over (X)}−X∥_(F) ≤C√{square root over (sN)}δ.

Denote h=(D^(β))^(T)({circumflex over (x)}−x), assume the support set of(D^(β))^(T)x is S with cardinality s, the complement set of S is S^(c).Since {circumflex over (x)} is a solution to (3.1), one has

∥(D ^(β))^(T) x∥ ₁≥∥(D ^(β))^(T) x+h∥ ₁≥∥(D ^(β))^(T) x∥ ₁−∥h _(S)∥₁+∥h_(S) c∥ ₁,

which gives ∥h_(S)∥₁≥∥h_(S)c∥₁, one can bound the

₁-norm of the misfit h as

∥h∥₁ =∥h _(S)∥₁ +∥h _(S) c∥₁≤2∥h _(S)∥₁≤2^(β+r+1) sδ,

where the last inequality is due to

∥h∥ _(∞)=∥(D ^(β))^(T)({circumflex over (x)}−x)∥_(∞)=∥(D ^(β))^(T) D^(r) D ^(−r)({circumflex over (x)}−x)∥_(∞)≤2^(β+r)δ.

Then the following properties hold

∥(D ^(β))^(T)({circumflex over (x)}−x)∥₁≤2^(β+r+1) sδ, ∥D^(−β)({circumflex over (x)}−x)∥_(∞)=∥D ^(r−β) D ^(−r)({circumflex over(x)}−x)∥_(∞)≤2^(r−β)δ.

Note that the inequalities above are bounded in

₁-norm and

_(∞) norm, which are dual to each other,one can therefore bound thereconstruction error ∥{circumflex over (x)}−x∥₂ using

{circumflex over (x)}−x,{circumflex over (x)}−x

=

(D ^(β))^(T)({circumflex over (x)}−x),D ^(−β)({circumflex over (x)}−x)

≤2^(2r+1) sδ ².

This is equivalent to saying, we have ∥{circumflex over(x)}−x∥₂≤C√{square root over (s)}δ.

Second, consider Class 2, where the image X satisfies Assumption 2.2 andthe encoder is Q_(2D). For simplicity, assume the patch number is 1(there is only one patch identical to the original image). Results forlarger patch numbers are similar. The following theorem establishes theerror bound for 2D reconstruction of X from its quantization Q_(2D)(X)using (2.6).

Theorem 3.3 If the original matrix X satisfies Assumption 2.2 , let{circumflex over (X)} be a solution to (2.6), then

∥{circumflex over (X)}−X∥_(F) ≤C√{square root over (s)}δ.  (3.3)

Denote H₁=D^(T)({circumflex over (X)}−X), H₂=({circumflex over (X)}−X)D,S_(A) and S_(B) are the support sets of D^(T)X and XD respectively, thecorresponding complement sets are S_(A) ^(C) and S_(B) ^(C),respectively. By assumption, |S_(A)|+|S_(B)|≤s. Also notice that

∥D ^(T) X∥ ₁+∥XD∥₁≥∥D ^(T) {circumflex over (X)}∥₁+∥{circumflex over(X)}D∥₁=∥D ^(T) X+H ₁∥₁+∥XD+H ₂∥₁≥∥D ^(T) X∥₁−∥(H ₁)_(S) _(A) ∥₁+∥(H₁)_(S) _(A) _(C)∥₁+∥XD∥₁−∥(H ₂)_(S) _(B) ∥₁+∥(H ₂)_(S) _(B) _(C)∥₁

which gives ∥(H₁)_(S) _(A) _(C)∥₁+∥(H₂)_(S) _(C)∥₁≤∥(H₁)_(S) _(A)∥₁+∥(H₂)_(S) _(B) ∥₁, hence

∥H ₁∥₁ +H ₂∥₁≤2(∥(H ₁)_(S) _(A) ∥₁+∥(H ₂)_(S) _(B) ∥₁)≤Csδ.

Here the last inequality is due to ∥H₁∥_(∞)=∥D^(T)D(D⁻¹({circumflex over(X)}−X)D^(−T))D^(T)∥_(∞)≤8δ, similarly, ∥H₂∥_(∞)≤8δ. Then one has thefollowing constraints:

∥D ^(T)({circumflex over (X)}−X)∥₁ ≤Csδ,∥D ⁻¹({circumflex over(X)}−X)∥_(∞)≤2δ.

Similar to the proof of Theorem 3.1, the inequalities above lead to

${{{\hat{X} - X}}_{F} = {\langle {{D^{T}( {\hat{X} - X} )},{D^{- 1}( {\hat{X} - X} )}} \rangle^{\frac{1}{2}} \leq {C\sqrt{s}\delta}}},$

then (3.3) also holds.

Third, consider reconstruction of images with minimum separationcondition (Class 3), where the image X satisfies Assumption 2.3. Same asin Class 1, use Q_(col) (column by column quantization) for encoding.

For x∈

^(N), q=Q^(ΣΔ,β)(x), β=1 or 2, r≥β, denote v as the last r rows ofD^(−r) (x−q), then (2.7) reduces to

$\begin{matrix}{{\hat{x} = {{\arg\mspace{11mu}{\min_{Z}{{{D_{1}^{\beta}z}}_{1}{{st}.{{D^{- r}( {z - q} )}}_{\infty}}}}} \leq {\delta/2}}},{{{( {D^{- r}( {z - q} )} )_{N - {r:{N - 1}}} - v}}_{\infty} < {( \frac{1}{2N} )^{r}{\delta.}}}} & (3.4)\end{matrix}$

There are two differences between this decoder and that for Class 1: 1)here D₁ is the circulant difference matrix instead of the forwarddifference matrix. This is to ensure that the separation condition issatisfied at the boundary; and 2) in order for the extra separationassumption to improve the error bound over Class 1, one needs to use afew more bits to encode the boundary pixels. The total number ofboundary bits is of order 0(logN), which is negligible comparing to the0(N) bits needed for the interior pixels.

Theorem 3.4 For high order ΣΔ quantization, i.e., r≥2, assume D₁ ^(β)xsatisfies Λ_(M)-minimization separation condition, and {circumflex over(x)} is a solution to (3.4), then for arbitrary resolution L≤N/2, thefollowing error bound holds:

$\begin{matrix}{{{P_{L}( {\hat{x} - x} )}}_{\infty} \leq {C\frac{L^{2}}{N^{r}}M^{r + \beta - 2}{\delta.}}} & (3.5)\end{matrix}$

Here P_(L) is the projection onto the low frequency domain withbandwidth L, i.e., P_(L)=F_(L) ^(*)F_(L) with F_(L) being the first Lrows of DFT matrix.

Again, since the image X is sliced into columns and reconstructedindividually, if X satisfies Assumption 2.3, let {circumflex over (X)}be the reconstructed image concatenated from individual columns whichare solutions to (3.4), the infinity norm error bound for decoder (2.7)is then

${{{P_{L}( {\hat{X} - X} )}}_{\infty} \leq {C\frac{L^{2}}{N^{r}}M^{r + \beta - 2}\delta}},$

note that ∥·∥_(∞) is the element-wise

_(∞) norm. Substitute L with N/2, obtain

${{\hat{X} - X}}_{F} \leq {C\frac{M^{r + \beta - 2}}{N^{r - 3}}{\delta.}}$

Consider using decoder (3.4) only when the gradients of each columnsatisfies minimum separation condition, i.e., for all i=1,2, . . . , N,D₁ ^(β) x _(i)∈

^(N) satisfies minimum separation condition with some small constantM«N. In this case, the worst case

_(∞)-norm error bound for arbitrary resolution approximates 0 when r→∞.

In order to prove Theorem 3.4, super-resolution analysis is performedwithin the Sigma Delta reconstructions and adjust the analysis to fit ina discrete setting. First, one needs the following lemma.

Lemma 3.7—For feasible {circumflex over (x)}∈

^(N) which satisfies the constraints in (3.4), the following inequalityholds:

∥

_(M) D ₁ ^(β)({circumflex over (x)}−x)∥₂≲(MN)^(r+β) √{square root over(N)}δ.

Recall that for z∈

^(N), discrete Fourier transform

${\mathcal{F}_{k}z} = {\sum\limits_{n = 0}^{N - 1}{z_{n}{e^{{- i}2\pi\frac{kn}{N}}.}}}$

For nonzero frequency k≠0, denote

${\alpha = \frac{1}{1 - e^{{- i}2\pi\frac{k}{N}}}},$

then one has

${{\mathcal{F}_{k}D_{1}z} = {{\sum\limits_{n = 0}^{N - 1}{( {D_{1}z} )_{n}e^{{- i}\; 2\;\pi\frac{kn}{N}}}} = {{( {1 - e^{{- i}\; 2\pi\frac{k}{N}}} )\mathcal{F}_{k}z} = {\alpha^{- 1}\mathcal{F}_{k}{z.\mspace{14mu}{Also}}}}}},{{\mathcal{F}_{k}D^{- 1}z} = {{\sum\limits_{n = 0}^{N - 1}{( {\sum\limits_{j = 0}^{n}z_{j}} )e^{{- i}2\pi\frac{kn}{N}}}} = {{\sum\limits_{n = 0}^{N - 1}{z_{n}{\sum\limits_{j = n}^{N - 1}e^{{- i}2\pi\frac{kj}{N}}}}} = {{\frac{1}{1 - e^{{- i_{2}}\pi\frac{k}{N}}}{\sum\limits_{n = 0}^{N - 1}{z_{n}( {e^{{- i}2\pi\frac{kn}{N}} - 1} )}}} = {{{\alpha\mathcal{F}}_{k}z} - {{\alpha( {D^{- 1}z} )}_{N - 1}.}}}}}}$

Similarly,

_(k) D ⁻² z=a

_(K) D ⁻¹ z−a(D ⁻² z)_(N−1) =a ²

_(k) z−a ²(D ⁻¹ z)_(N−1) −a(D ⁻² z)_(N−1),

_(k) D ⁻³ z=a ³

_(k) z−a ³(D ⁻¹ z)_(N−1) −a ²(D ⁻² z)_(N−1) −a(D ⁻³ z)_(N−1).

More generally, for β=1 or 2, r≥2,

$\begin{matrix}{{\mathcal{F}_{k}D^{- r}z} = {{\alpha^{r}\mathcal{F}_{k}z} - {\alpha^{r}( {D^{- 1}z} )}_{N - 1} - {\alpha^{r - 1}( {D^{- 2}z} )}_{N - 1} - \cdots - {\alpha( {D^{- r}z} )}_{N - 1}}} \\{= {{\alpha^{r + \beta}\mathcal{F}_{k}D_{1}^{\beta}z} - {\alpha^{r}( {D^{- 1}z} )}_{N - 1} - {\alpha^{r - 1}( {D^{- 2}z} )}_{N - 1} - \cdots - {{\alpha( {D^{- r}z} )}_{N - 1}.}}}\end{matrix}$

Multiplying a^(−(r+β)) on both sides and rearranging the terms gives,

${\mathcal{F}_{k}D_{1}^{\beta}z} = {{( {1 - e^{{- i}2\pi\frac{k}{N}}} )^{r + \beta}\mathcal{F}_{k}D^{- r}z} + {( {1 - e^{{- i}2\pi\frac{k}{N}}} )^{\beta}( {D^{- 1}z} )_{N - 1}} + {( {1 - e^{{- i}2\pi\frac{k}{N}}} )^{\beta + 1}( {D^{- 2}z} )_{N - 1}} + \cdots + {( {1 - e^{{- i}2\pi\frac{k}{N}}} )^{\beta + r - 1}{( {D^{- r}z} )_{N - 1}.}}}$

Note that for k=0,

₀D₁ ^(β)z=Σ_(n=0) ^(N−1)(D₁ ^(β)z)_(n)=0. Then the equation above holdsfor all z∈

^(N) and integer k with 0≤|k|≤N/2, denote Λ∈

^(2M+1,2M+1) as the diagonal matrix with diagonal entries being

${1 - e^{{- i}2\pi\frac{k}{N}}},{{- M} \leq k \leq M},$

obtain the matrix form of the equations above

F _(M) D ₁ ^(β) z=Λ^(r+β) F _(M) D ^(−r) z+Λ^(β)(D ⁻¹ z)⁻¹1+Λ^(β+1)(D ⁻²z)_(N−1)1+ . . . +Λ^(β+r−1)(D ^(−r) z)_(N−1)1=Λ^(r+β) F _(M) D ^(−r) z+Σ₈₀ ₌₁ ^(r)Λ^(β+λ−1)(D ^(−λ) z)_(N−1).

Recall that F_(M) contains the rows of DFT matrix with frequency within{−L, −L+1, . . . , L}. Multiplying

$\frac{1}{N}F_{M}^{*}$

on both sides of and replace z with z={circumflex over (x)}−x, it gives

${P_{M}{D_{1}^{\beta}( {\hat{x} - x} )}} = {{\frac{1}{N}F_{M}^{*}\Lambda^{r + \beta}F_{M}{D^{- r}( {\hat{x} - x} )}} + {\sum\limits_{\ell = 1}^{r}{\frac{1}{N}F_{M}^{*}{{\Lambda^{\beta + \ell - 1}( {D^{- \ell}( {\hat{x} - x} )} )}_{N - 1}.}}}}$

Note that

${{{1 - e^{{- i}2\pi\frac{k}{N}}}} \leq {2\pi\frac{k}{N}} \leq {2\pi\frac{M}{N}}},{{{hence}\mspace{14mu}{{\frac{1}{N}F_{M}^{*}\Lambda^{\ell}}}_{2}} \lesssim {\frac{1}{\sqrt{N}}( \frac{M}{N} )^{\ell}}},$

the following error bound in

₂ norm holds

${{{{P_{M}{D_{1}^{\beta}( {\hat{x} - x} )}}}_{2} \leq {{{{\frac{1}{N}F_{M}^{*}\Lambda^{r + \beta}F_{M}}}_{2}{{D^{- r}( {\hat{x} - x} )}}_{2}} +}}\quad}$${{\quad{{\sum\limits_{\ell = 1}^{r}{{{\frac{1}{N}F_{M}^{*}\Lambda^{\beta + \ell - 1}}}_{2}\sqrt{M}{{D^{- \ell}( {\hat{x} - x} )}_{N - 1}}}} \lesssim {{( \frac{M}{N} )^{r + \beta}\sqrt{N}\delta} + {\sum\limits_{\ell = 1}^{r}{( \frac{M}{N} )^{\beta + \ell - \frac{1}{2}} \cdot}}}}\quad}2^{r}( \frac{1}{2N} )^{r}\delta} \lesssim {\quad{( \frac{M}{N} )^{r + \beta}\sqrt{N}{\delta.}}}$

Therefore the low frequency error P_(M)D₁ ^(β)({circumflex over (x)}−x)decreases with the ΣΔ quantization order r. Denote h=D₁ ^(β)({circumflexover (x)}−x), divide h into two parts based on whether the location ofeach entry is within a neighbor of some support of D₁ ^(β)x.

For simplicity of proof, one can view the vectors x,{circumflex over(x)},h∈

^(N) as signals on [0,1] sampled at grid t_(n)=n/N,n=0,1, . . . N−1. ForD₁ ^(β)x satisfying Λ_(M)-minimum separation condition with support setS={ξ₁,ξ₂, . . . , ξ_(s)}⊏[0,1], define

S _(M)(j)={x∈[0,1]:|x−ξ_(j)|≤0.16M ⁻¹ }, j=1,2, . . . , s,

and

S _(M) =U _(j=1) ^(s) S _(M)(j),S _(M) ^(c)=[0,1]\S _(M).

Then the following lemma holds.

Lemma 3.8 If D₁ ^(β)x satisfying Λ_(M)-minimum separation condition,with definitions above, there exists a constant C>0 such that thefollowing hold

Σ_(t) _(n) _(∈S) ^(c) |h _(n) |≤C√{square root over (N)}∥P _(m)h∥₂,  (3.6)

Σ_(j)Σ_(t) _(n) _(∈S) _(M) _((j)) |h _(n) ∥t _(n) −s _(j)|² ≤CM ⁻²√{square root over (N)}∥P _(M) h∥₂.  (3.7)

Denote the restriction of h to a set S as P_(S)h, andP_(S)h(ξ_(j))=|P_(S)h(ξ_(j))|e^(iϕj), j=1,2, . . . , s. By Lemma 6.1,take v_(j)=e^(iϕj), j=1,2. . . s, there exists f(t)=Σ_(k=−M) ^(M)c_(k)e^(i2πkt) defined in [0,1] and constant C₁, C₂ such that

f(t _(j))=e ^(iϕj) , j=1,2. . . s,  (3.8)

|f(t)|≤1−C ₁ M ²(t−ξ_(j))² ,t∈S _(M)(j),  (3.9)

|f(t)|<1−C ₂ ,t∈S _(M) ^(c).  (3.10)

Denote f_(n)=f(t_(n)) where t_(n)=n/N, n=0,1, . . . , N−1, then

${\sum\limits_{t_{n} \in S}{h_{n}}} = {{{{\sum\limits_{t_{n} \in S}{{\overset{\_}{f}}_{n}h_{n}}}} \leq {{{\sum\limits_{n = 0}^{N - 1}{{\overset{\_}{f}}_{n}h_{n}}}} + {{\sum\limits_{t_{n} \in S_{M}^{c}}{{\overset{\_}{f}}_{n}h_{n}}}} + {{\sum_{j}{\sum\limits_{t_{n} \in {{S_{M}{(j)}}\backslash{\{ s_{j}\}}}}{{\overset{\_}{f}}_{n}h_{n}}}}}} \leq {{{\sum\limits_{n = 0}^{N - 1}{{\overset{\_}{f}}_{n}h_{n}}}} + {( {1 - C_{2}} ){\sum\limits_{t_{n} \in S_{M}^{c}}{h_{n}}}} + {\sum\limits_{j}{\sum\limits_{t_{n} \in {S_{M}{(j)}}}{( {1 - {C_{1}M^{2}\;( {t_{n} - \xi_{j}} )^{2}}} ){h_{n}}}}}}} = {{{\sum\limits_{n = 0}^{N - 1}{{\overset{\_}{f}}_{n}h_{n}}}} + {\sum\limits_{t_{n} \in S^{c}}{h_{n}}} - {C_{2}{\sum\limits_{t_{n} \in S_{M}^{c}}{h_{n}}}} - {C_{1}M^{2}{\sum_{j}{\sum\limits_{t_{n} \in {S_{M}{(j)}}}{( {t_{n} - \xi_{j}} )^{2}{h_{n}}}}}}}}$

Rearrange the inequality, one obtains

$\begin{matrix}{{{C_{2}{\sum\limits_{t_{n} \in S_{M}^{c}}{h_{n}}}} + {C_{1}M^{2}{\sum\limits_{j}{\sum\limits_{t_{n} \in {S_{M}{(j)}}}{( {t_{n} - \xi_{j}} )^{2}{h_{n}}}}}}} \leq {{{\sum\limits_{n = 0}^{N - 1}{{\overset{\_}{f}}_{n}h_{n}}}} + {\sum\limits_{t_{n} \in S^{c}}{h_{n}}} - {\sum\limits_{t_{n} \in S}{{h_{n}}.}}}} & (3.11)\end{matrix}$

Note that |Σ_(n=0) ^(N−1) f_(n)h_(n)|=|<f,h>|=|<f,P_(M)h>|≤∥f∥₂∥P_(M)h∥₂≤√{square root over(N)}∥P_(M)h∥₂, also note that {circumflex over (x)} is a solution of(3.4), so

${{{{D_{1}^{\beta}x}}_{1} \geq {{D_{1}^{\beta}\overset{\hat{}}{x}}}_{1}} = {{{{D_{1}^{\beta}x} + h}}_{1} \geq {{\sum\limits_{t_{n} \in S}{( {D_{1}^{\beta}x} )_{n}}} - {\sum\limits_{t_{n} \in S}{h_{n}}} + {\sum\limits_{t_{n} \in S^{c}}{h_{n}}}}}},\mspace{20mu}{then}$$\mspace{20mu}{{{\sum\limits_{t_{n} \in S^{c}}{h_{n}}} - {\sum\limits_{t_{n} \in S}{h_{n}}}} \leq 0.}$

then (3.11) becomes

C ₂Σ_(t) _(n) _(∈S) _(M) _(c) |h _(n) |+C ₁ M ²Σ_(j)Σ_(t) _(n) _(∈S)_(M) _((j))(t _(n)−ξ_(j))² |h _(n) |≤√{square root over (N)}∥P _(M) h∥₂.

From this inequality we can derive (3.6) and (3.7).

Next, bound ∥K*D₁ ^(β)({circumflex over (x)}−x)∥_(∞) for arbitrarykernel K with period 1. For arbitrary

${x_{0} \in \{ {0,\frac{1}{N},\ldots\mspace{14mu},\frac{N - 1}{N}} \}},$|K*h(x ₀)|=|Σ_(n=0) ^(N−1) K(x ₀ −t _(n))h _(n)|≤|Σ_(j)Σ_(t) _(n) _(∈S)_(M) _((j)) K(x ₀ −t _(n))h _(n)|+∥K∥ _(∞)Σ_(t) _(n) _(∈S) _(M) _(c) |h_(n)|.  (3.12)

On the interval S_(M)(j), approximate K(x₀−t_(n)) with its first-orderTaylor expansion around x₀−ξ_(j):

K(x ₀ −t _(n))=K(x ₀−ξ_(j))+K′(x ₀−ξ_(j))(ξ_(j) −t _(n))+12K″(μ_(n))|t_(n)−ξ_(j)|² ,x|S _(M)(j),

with some μ_(n) ∈S_(M)(j) depending on x₀,s_(j), x. Inserting this in to(3.12), one obtains

${{K*{h( x_{0} )}}} \leq {{{\sum\limits_{j}{\sum\limits_{t_{n} \in {S_{M}{(j)}}}{( {{K( {x_{0} - \xi_{j}} )} - {{K^{\prime}( {x_{0} - \xi_{j}} )}( {t_{n} - \xi_{j}} )}} )h_{n}}}}} + {{K^{''}}_{\infty}{\sum\limits_{j}{\sum\limits_{t_{n} \in S_{M}^{c}}{{{t_{n} - \xi_{j}}}^{2}{h_{n}}}}}} + {{K}_{\infty}{\sum\limits_{t_{n} \in S_{M}^{c}}{{h_{n}}.}}}}$

To bound the first term on the right hand side, use an interpolationargument. Let a,b∈C^(S) such that a_(j)=K(x₀−ξ_(j)), b_(j)=−K′(x₀−ξ_(j))and by Proposition 2.4 in [15], there exists a function f∈C([0,1],Λ_(M)) such that

∥f∥_(∞)≲∥K∥_(∞) +M ⁻¹ ∥K′∥_(∞) , |f(x)−a _(j) −b _(j)(x−ξ_(j))|≲(M ² ∥K∥_(∞) +M∥K′∥ _(∞))|x−ξ_(j)|² ,x∈S _(M)(j).

which gives

${{\sum\limits_{j}{\sum\limits_{t_{n} \in {S_{M}{(j)}}}{( {{K( {x_{0} - \xi_{j}} )} - {{K^{\prime}( {x_{0} - \xi_{j}} )}( {t_{n} - \xi_{j}} )}} )h_{n}}}}} \leq {{{\sum\limits_{j}{\sum\limits_{t_{n} \in {S_{M}{(j)}}}{( {{f( t_{n} )} - {K( {x_{0} - \xi_{j}} )} + {{K^{\prime}( {x_{0} - \xi_{j}} )}( {t_{n} - \xi_{j}} )}} )h_{n}}}}} + {{\sum\limits_{t_{n} \in S_{M}}{f_{n}h_{n}}}}} \lesssim {{( {{M^{2}{K}_{\infty}} + {M{K^{\prime}}_{\infty}}} ){\sum\limits_{j}{\sum\limits_{t_{n} \in {S_{M}{(j)}}}{{{t_{n} - \xi_{j}}}^{2}{h_{n}}}}}} + {{\sum\limits_{n = 0}^{N - 1}{f_{n}h_{n}}}} + {{\sum\limits_{n \in S_{M}^{c}}{f_{n}h_{n}}}}}$

Also, obtain

|Σ_(t) _(n) _(∈S) _(M) _(c) f _(n) h _(n)|≲(∥K∥ _(∞) +M ⁻¹∥K′∥_(∞))Σ_(t) _(n) _(∈S) _(M) _(c) |h _(n)| |Σ_(n=0) ^(N−1) f _(n) h_(n)|≤∥f∥₂ ∥P _(M) h∥₂≲(∥K∥ _(∞) +M ⁻¹ ∥K′∥ _(∞))√{square root over(N)}∥P _(M) h∥ ₂.

Combining these results, one obtains

|K*h(x ₀)|≲(2∥K∥ _(∞) +M ⁻¹ ∥K′∥ _(∞))Σ_(t) _(n) _(∈S) _(M) _(c) |h_(n)|+(∥K∥ _(∞) +M ⁻¹ ∥K′∥ _(∞))√{square root over (N)}∥P _(M) h∥ ₂+(M ²∥K∥ _(∞) +M∥K′∥ _(∞) +∥K″∥_(∞))Σ_(j)ρ_(t) _(n) _(∈S) _(M(J)) |t_(n)−ξ_(j)|² |h _(n)|≲(∥K∥ _(∞) +M ⁻¹ ∥K′∥ _(∞) +M ⁻² ∥K″∥ _(∞))√{squareroot over (N)}∥P _(M) h∥₂   (3.13)

Next, for arbitrary resolution L, denote

${{K_{L}(x)} = {\frac{1}{N}{\sum\limits_{{k = {- L}},{k \neq 0}}^{L}e^{i\; 2\pi\;{kx}}}}},$

then by direct calculation one obtains

${{P_{L}( {\overset{\hat{}}{x} - x} )} = {{K_{L}*( {\overset{\hat{}}{x} - x} )} + {\frac{1}{N}{\sum\limits_{n = 0}^{N - 1}{( {\overset{\hat{}}{x} - x} )_{n}1}}}}},{{{{P_{L}( {\overset{\hat{}}{x} - x} )} - {K_{L}*( {\overset{\hat{}}{x} - x} )}}}_{\infty} < {\frac{1}{N^{r}}.}}$

Here the last equation is due to the constraint in (3.4) which forcesthe absolute value of the last r rows in D^(−r) ({circumflex over(x)}−x) to be smaller than

${2*( \frac{1}{2N} )^{r}},$

then

${{\sum\limits_{n = 0}^{N - 1}( {\overset{\hat{}}{x} - x} )_{n}}} = {{( {D^{- 1}( {\overset{\hat{}}{x} - x} )} )_{N - 1}} = {{2^{r - 1}{{D^{- r}( {\overset{\hat{}}{x} - x} )}_{N - {r:{N - 1}}}}_{\infty}} \leq {( \frac{1}{N} )^{r}.}}}$

In the following, the goal is to bound the infinity norm∥K_(L)*({circumflex over (x)}−x)∥_(∞). Consider TV order=1. ForK_(L)(x), construct a corresponding function {tilde over (K)}_(L)(x),x∈[0,1], which satisfies D₁{tilde over (K)}(t_(j))=K_(L)(t_(j)) for allj=0,1, . . . , N−1. Then one can bound ∥K_(L)* ({circumflex over(x)}−x)∥_(∞)by

∥K _(L)*({circumflex over (x)}−x)∥_(∞)=∥D ₁ {tilde over (K)}_(L)*({circumflex over (x)}−x)∥_(∞)=∥{tilde over (K)} _(L) *D₁({circumflex over (x)}−x)∥_(∞)=∥{tilde over (K)} _(L) *h∥ _(∞),

which can be further bounded by (3.13). Consider

${{{\overset{\sim}{K}}_{L}(x)} = {\frac{1}{N}{\sum\limits_{{k = {- L}},{k \neq 0}}^{L}\frac{1 - e^{i\; 2{{\pi k}({x + \frac{1}{N}})}}}{1 - e^{i\; 2\pi\frac{k}{N}}}}}},$

see that

${{{\overset{\sim}{K}}_{L}( t_{j} )} = {{\frac{1}{N}{\sum\limits_{{k = {- L}},{k \neq 0}}^{L}\frac{1 - e^{i2\pi k\frac{j + 1}{N}}}{1 - e^{i2\pi\frac{k}{N}}}}} = {{\frac{1}{N}{\sum\limits_{{k = {- L}},{k \neq 0}}^{L}{\sum\limits_{n = 0}^{j}e^{i2\pi k\frac{n}{N}}}}} = {\sum\limits_{n = 0}^{j}{K_{L}( t_{n} )}}}}},\mspace{20mu}{j = 0},1,\ldots\mspace{14mu},{N - 1.}$

So it satisfies the desired property

D ₁ {tilde over (K)} _(L)(t _(j))={tilde over (K)} _(L)(t _(j))−{tildeover (K)} _(L)(t _(j−1))=K _(L)(t _(j)).

Now show that the infinity norm of {tilde over (K)}_(L)(x) is bounded bysome constant for arbitrary L≤N/2 and x∈[0,1]. Since e^(i2πkx) is1-periodic, one has

${\sup\limits_{x}{{\overset{\sim}{K}}_{L}(x)}} = {{\frac{1}{N}\sup\limits_{x}{\sum\limits_{{k = {- L}},{k \neq 0}}^{L}\frac{1 - e^{i\; 2\pi\;{kx}}}{1 - e^{i\; 2\pi\frac{k}{N}}}}} = {{{\frac{1}{N}\sup\limits_{x}{\sum\limits_{k = 1}^{L}\frac{1 - {\cos( {2\pi\;{kx}} )} - {i\;{\sin( {2\pi\;{kx}} )}}}{1 - {\cos( {2\pi\frac{k}{N}} )} - {i\;{\sin( {2\pi\frac{k}{N}} )}}}}} + \frac{1 - {\cos( {2\pi\;{kx}} )} + {i{\sin( {2\pi\;{kx}} )}}}{1 - {\cos( {2\pi\frac{k}{N}} )} + {i\;{\sin( {2\pi\frac{k}{N}} )}}}} = {{\frac{1}{N}\sup\limits_{x}{\sum\limits_{k = 1}^{L}\frac{{( {1 - {\cos( {2\pi\;{kx}} )}} )( {1 - {\cos( {2\pi\frac{k}{N}} )}} )} + {{\sin( {2\pi\;{kx}} )}{\sin( {2\pi\frac{k}{N}} )}}}{1 - {\cos( {2\pi\frac{k}{N}} )}}}} = {{{\frac{1}{N}\sup\limits_{x}{\sum\limits_{k = 1}^{L}( {1 - {\cos( {2\pi\;{kx}} )}} )}} + \frac{{\sin( {2\pi\;{kx}} )}{\cos( {\pi\frac{k}{N}} )}}{\sin( {\pi\frac{k}{N}} )}} = {{\frac{L}{N} + {\frac{1}{N}\sup\limits_{x}{\sum\limits_{k = 1}^{L}\frac{{{\sin( {2\pi\;{kx}} )}{\cos( {\pi\frac{k}{N}} )}} - {{\sin( {\pi\frac{k}{N}} )}{\cos( {2\pi\;{kx}} )}}}{\sin( {\pi\frac{k}{N}} )}}}} = {{\frac{L}{N} + {\frac{1}{N}\sup\limits_{x}{\sum\limits_{k = 1}^{L}\frac{\sin( {2{{\pi k}( {x\frac{1}{2N}} )}} )}{\sin( {\pi\frac{k}{N}} )}}}} = {\frac{L}{N} + {\frac{1}{N}\sup\limits_{x}{\sum\limits_{k = 1}^{L}\frac{\sin( {2\pi\;{kx}} )}{\sin( {\pi\frac{k}{N}} )}}}}}}}}}}$

Notice that for k≤L≤N/2,

${{\pi\frac{k}{N}} \leq \frac{\pi}{2}},$

then

$\sin( {\pi\frac{k}{N}} )$

is of the same order as

$\pi\frac{k}{N}$

since for all

$\begin{matrix}{{0 < x \leq \frac{\pi}{2}},{{x - \frac{x^{3}}{6}} \leq {\sin(x)} < x},} & \;\end{matrix}$

which further gives

${0.58\mspace{11mu}\pi\frac{k}{N}} \leq {{\pi\frac{k}{N}} - {( {\pi\frac{k}{N}} )^{3}/6}} \leq {\sin( {\pi\frac{k}{N}} )} < {\pi{\frac{k}{N}.}}$

Then see that

${{{\frac{1}{N}{\sum\limits_{k = 1}^{L}\frac{\sin( {2\pi\;{kx}} )}{\sin( {\pi\frac{k}{N}} )}}} - {\sum\limits_{k = 1}^{L}\frac{\sin( {2\pi\;{kx}} )}{\pi k}}}} = {{\frac{1}{N}{\sum\limits_{k = 1}^{L}{{\sin( {2\pi\;{kx}} )}( {\frac{1}{\sin( {\pi\frac{k}{N}} )} - \frac{1}{\pi\frac{k}{N}}} )}}} = {{\frac{1}{N}{\sum\limits_{k = 1}^{L}{{\sin( {2\pi\;{kx}} )}\frac{{\pi\frac{k}{N}} - {\sin( {\pi\frac{k}{N}} )}}{{\sin( {\pi\frac{k}{N}} )}\pi\frac{k}{N}}}}} \leq {\frac{1}{N}{\sum\limits_{k = 1}^{L}\frac{\frac{1}{6}( {\pi\frac{k}{N}} )^{3}}{{0.5}8( {\pi\frac{k}{N}} )^{2}}}} \leq {0.2{3.}}}}$

It is known that the summation

$\sum\limits_{k = 1}^{n}\frac{\sin( {2\pi\;{kx}} )}{k}$

is uniformly bounded by some constant smaller than 2 for arbitrary n∈

and x∈

, so {tilde over (K)}_(L)(x) is also bounded, there exists some constantC such that ∥{tilde over (K)}_(L)∥_(∞)≤C. Therefore by (3.13) one has,

${{{K_{L}*( {\overset{\hat{}}{x} - x} )}}_{\infty} \leq {{D_{1}{\overset{\sim}{K}}_{L}*( {\overset{\hat{}}{x} - x} )}}_{\infty}} = {{{{{\overset{\sim}{K}}_{L}*h}}_{\infty} \leq {C\frac{L^{2}}{M^{2}}{\sqrt{N} \cdot \frac{M^{r + 1}}{N^{r + \frac{1}{2}}}}\delta}} = {C\frac{L^{2}}{N}( \frac{M}{N} )^{r - 1}{\delta.}}}$

For the last inequality, Bernstein's Inequality is used fortrigonometric sums to obtain ∥{tilde over (K)}_(L)∥_(∞)≤C, ∥{tilde over(K)}_(L′)∥{tilde over (K)}_(L″)∥_(∞)≤CL².For TV order=2, consider

${{{\overset{\sim}{K}}_{L}(x)} = {{- \frac{1}{N}}{\sum\limits_{{k = {- L}},{k \neq 0}}^{L}\frac{e^{i\; 2\pi\frac{k}{N}} - e^{i\; 2\pi\;{k({\frac{2}{N} + x})}}}{( {1 - e^{i\; 2\pi\frac{k}{N}}} )^{2}}}}},$

similarly, one can show ∥{tilde over (K)}∥_(∞)≤N, then

${{{K_{L}*( {\overset{\hat{}}{x} - x} )}}_{\infty} \leq {{D_{1}^{2}{\overset{\sim}{K}}_{L}*( {\overset{\hat{}}{x} - x} )}}_{\infty}} = {{{{{\overset{\sim}{K}}_{L}*h}}_{\infty} \leq {C\frac{L^{2}}{M^{2}}{N^{\frac{3}{2}} \cdot \frac{M^{r + 2}}{N^{r + \frac{3}{2}}}}\delta}} = {C{L^{2}( \frac{M}{N} )}^{r}{\delta.}}}$

In conclusion, for β=1 or 2, one has have the

_(∞)-norm error bound

${{{K_{L}*( {\overset{\hat{}}{x} - x} )}}_{\infty} \leq {C\frac{L^{2}}{N^{r}}M^{r + \beta - 2}\delta}},$

which further gives

${{P_{L}( {\overset{\hat{}}{x} - x} )}}_{\infty} \leq {{{K_{L}*( {\overset{\hat{}}{x} - x} )}}_{\infty} + {{\frac{1}{N}{\sum\limits_{n = 0}^{N - 1}{( {\overset{\hat{}}{x} - x} )_{n}1}}}}_{\infty}} \leq {C\frac{L^{2}}{N^{r}}M^{r + \beta - 2}{\delta.}}$

An optimization algorihtm is presenetd for for solving (3.1). Here, thealgorithm is presenetd for the simple case when the TV order β and thequantization order r are both 1, which reduces (3.1) to

$\begin{matrix}{{\min\limits_{z}{{{D^{T}z}}_{1}\mspace{14mu}{subject}\mspace{14mu}{to}\mspace{14mu}{{D^{- 1}( {z - q} )}}_{\infty}}} \leq {\delta/2.}} & (4.1)\end{matrix}$

All other optimization problems proposed in this paper can be solved insimilar ways.

Let us start with writing out the Lagrangian dual of (4.1)

$\begin{matrix}{{\mathcal{L}( {z,y} )} = {{D^{T_{Z}}}_{1} - {\frac{\delta}{2}{y}_{1}} + {\langle {y,{D^{- 1}( {z - q} )}} \rangle.}}} & (4.2)\end{matrix}$

This dual formulation (4.2) is a special case of the general form

$\begin{matrix}{{{\max\limits_{y}{\min\limits_{x}\langle {{Lx},y} \rangle}} + {g(x)} - {f^{*}(y)}},} & (4.3)\end{matrix}$

which is known to be solvable by primal-dual algorithms such as theChambolle-Pock method (Algorithm 1).

Algorithm 1: Solve (4.3) using Chambolle-Pock Method Initializations: τ,σ > 0, τσ < 1, θ ∈ [0,1], x₀, y₀. and set x ₀ = x₀ Iterations: Updatex_(n), y_(n), x _(n) as follows:   $\quad\{ \begin{matrix}{y_{n + 1}\  = {Pro{x_{\sigma\; f^{*}}( {y_{n} + {\sigma\; L{\overset{\_}{x}}_{n}}} )}}} & (a) \\{x_{n + 1}\  = {Pro{x_{\tau g}( {x_{n} - {\tau L^{*}y_{n + 1}}} )}}} & (b) \\{{\overset{\_}{x}}_{n + 1}\  = {x_{n + 1} + {\theta( {x_{n + 1} - x_{n}} )}}} & (c)\end{matrix} $A natural way to apply Chambolle-Pock method on (4.2) is to letx=D^(T)z, L=D⁻¹D^(−T),g(x)=∥x∥₁,

${{f^{*}(y)} = {\frac{\delta}{2}{y}_{1}}},$

then (4.2) becomes

$\begin{matrix}{{\mathcal{L}( {x,y} )} = {{{x}_{1} - {\frac{\delta}{2}{y}_{1}} + \langle {y,{{D^{- 1}D^{- T}x} - {D^{- 1}q}}} \rangle} \equiv {\langle {{Lx},y} \rangle + {g(x)} - {{f^{*}(y)}.}}}} & (4.4)\end{matrix}$

Although this can be solved by the Chambolle-Pock method (Algorithm 1),it converges pretty slowly due to poor conditioning of L (the conditionnumber of L is 0(N²)).

One proposed strategy is, through a change of variable, moving the largecondition number from L to g in (4.3). This is beneficial as on the onehand, the improved condition number of L accelerated the convergence ofthe primal-dual updates; on the other hand, the increased conditionnumber on g is harmless to the inner loop due to the use of the proximalmap. Explicitly, the change of variable is through x=D⁻¹(z−q), thenz=Dx+q, the dual form becomes

$\begin{matrix}{{\mathcal{L}( {x,y} )} = {\langle {y,x} \rangle + {{{D^{T}Dx} + {D^{T}q}}}_{1} - {\frac{\delta}{2}{{y}_{1}.}}}} & (4.5)\end{matrix}$

Next, let g(x)=∥D^(T)Dx+D^(T)q∥₁,

${{f^{*}(y)} = {\frac{\delta}{2}{y}_{1}}},$

one can see that (4.5) also fits into the form of (4.3). ApplyingChambolle-Pock Method to solving (4.5), it gives the followingalgorithm:

Algorithm 2: Solve (4.5) using Chambolle-Pock Method Initializations: τ,σ > 0, τσ < 1, θ ∈ [0,1], x₀, y₀. and set x ₀ = x₀ Iterations: Updatex_(n), y_(n), x _(n) as follows:   $\quad\{ \begin{matrix}{y_{n + 1} = {{\arg\;{\min_{y}{\frac{\sigma\delta}{2}{y}_{1}}}} + {\frac{1}{2}{{y - ( {y_{n} + {\sigma{\overset{\_}{x}}_{n}}} )}}^{2}}}} & (a) \\{{x_{n + 1} = {{\tau{{{D^{T}Dx} + {D^{T}q}}}_{1}} + {\frac{1}{2}{{x - ( {x_{n} - {\tau y_{n + 1}}} )}}^{2}}}}\ } & (b) \\{{\overset{\_}{x}}_{n + 1}\  = {x_{n + 1} + {\theta( {x_{n + 1} - x_{n}} )}}} & (c)\end{matrix} $Note that it has been shown that when L=I (I is the identity matrix),step sizes τ=1/σ and extrapolation rate θ is set to 1, PDHG isequivalent to DRS and ADMM. Hence the above algorithm can also bederived by appropriate applications of the ADMM algorithm when thosespecial step sizes are used.

In Algorithm 2, step (a) has a closed form solution. To solve (b), onecan apply ADMM algorithm, the procedure of which is stated in Algorithm3.

Algorithm 3: Solve (b) in Algorithm 2 by ADMM Initializations: ρ > 0,x₀, u₀, b₀ Iterations: Update x_(n), u_(n), b_(n) as follows:  $\quad\{ \begin{matrix}{x_{n + 1} = {{{\arg\min}_{x}\frac{1}{2}{{x - ( {x_{n} - {\tau y_{n + 1}}} )}}_{2}^{2}} + {\frac{\rho}{2}{{{D^{T}Dx} + {D^{T}q} - u_{n} + b_{n}}}_{2}^{2}}}} \\{u_{n + 1} = {{\arg\;{\min_{u}{\tau{u}_{1}}}} + {\frac{\rho}{2}{{{D^{T}Dx_{n + 1}} + {D^{T}q} - u + b_{n}}}_{2}^{2}}}} \\{b_{n + 1} = {b_{n} + {D^{T}Dx_{n + 1}} + {D^{T}q} - u_{n + 1}}}\end{matrix} $

Numerical simulation of the proposed decoder on both 1D syntheticsignals and natural images is presented. An experiment is designed toconfirm the benefits of the proposed quantization method over MSQ on 1Dsignals (representing columns of images) proved in Theorem 3.1. Thesignal to be quantized is piece-wise constant or piece-wise linear withrandom boundary locations and random height/slope on each piece, whichsatisfies assumption 2.1. MSQ is compared with Sigma Delta quantizationcoupled with the decoder (3.1). The result is displayed in FIGS. 6A-6C.In FIG. 6A, MSQ is compared with 1st order Sigma Delta quantization,where the signal is piece-wise constant, not satisfying the minimumseparation condition. The decoder used for reconstruction is accordingto (3.1) with β=1. The SNR of the reconstructed image from Sigma Deltaquantization is 37.00, and the SNR of MSQ is 21.95. In FIG. 6B, MSQ iscompared with the 2nd order Sigma Delta quantization and a decoder (3.1)with β=2. In this case, the SNR of the Sigma Delta reconstruction is37.30, and the SNR of MSQ is 27.33. In FIG. 6C, the signal is piece-wiseconstant with random noise. MSQ is compared with the 1st order SigmaDelta quantization and decoder (3.1) with β=1. The SNR of Sigma Deltareconstruction is 33.40, and SNR of MSQ is 20.93. From FIG. 6A, with thesame number of bits, the reconstructed signal using 1^(st) order ΣΔquantization and decoder (3.1) is closer to the true signal andgenerally preserves the piece-wise constant structure. FIG. 6B showsthat the piece-wise linear signal is also better reconstructed by theproposed method. As there is no guarantee that a given signal strictlysatisfies our assumption, FIG. 6C tests the stability of the proposedmethod by adding random noise to the signal. It can be seen that evenwith additive random noise, the proposed encoder-decoder pair hasreasonably good performance.

FIGS. 7A and 7B show the reconstruction result of signals that satisfyminimum separation consdition. In FIG. 7A, the signal is piece-wiseconstant. For Sigma Delta quantization, 1st order Sigma Deltaquantization is used and a decoder (3.1) with β=r=1, as well as 3rdorder Sigma Delta quantization and decoder (3.4) with β=1, r=3. The SNRof the reconstructed signals are 41.90 and 33.69, respectively. The SNRof MSQ is 24.13. In FIG. 7B, the signal is piece-wise linear. For SigmaDelta quantization, 2^(nd) order Sigma Delta quantization is used and adecoder (3.1) with β=r=2, as well as 3rd order Sigma Delta quantizationand a decoder (3.4) with β=2, r=3. The SNR of the reconstructed signalsare 47.02 and 36.11, respectively. The SNR of MSQ is 25.70. If theminimum separation condition is met, one can achieve further errorreduction by using high order ΣΔ quantization. In both FIGS. 7A and 7B,the reconstructed signal with higher ΣΔ quantization order r is close tothe true signal compared the result using lower quantization order,which agrees with the theoretical result in Theorem 3.4.

Numerical results are also presented for 2D natural images. The bestperformance is usually achieved when both the TV order and the ΣΔquantization order are 1. This setting is used in the followingexperiments.

In the first example, gray-scale images are compard with the results of2D Sigma Delta quantization Q2D coupled with decoder (2.6) (sd2D), 1DSigma Delta quantization Qcol coupled with decoder (2.5) (sd1D) and theMSQ quantization, all quantization used the same number of bits and theoptimal alphabets. In terms of visual quality, (sd2D) is better than(sd1D) and much better than MSQ. In terms of the PSNR, (sd1D) isslightly better than (sd2D) and better than MSQ.

In the second experiment, the effect of dividing the image into multiplerectangle patches is evaluated in (sd2D), quantizing and reconstructingthe image individually using the test image Lena. The process can bedone in parallel, which makes the decoding process significantly fasterthan the single patch 2D reconstruction. There is usually no visibledifference between using 1 patch and multiple patches in quantizationand reconstruction. In both cases, the images look more natural andcloser to the original image than MSQ, as the latter is unable todistinguish the subtle difference between pixels, especially in the faceand shoulder area.

To further investigate where the improved PSNRs of the two Sigma Deltareconstructions come from, the absolute spectra of the threereconstructions is plotted as well as the absolute spectra of theresidue images. The residue image is defined as the difference betweenthe reconstructed image and the original image. As predicted, thedecoders can indeed retain the high frequency information whileeffectively compressing the low frequency noise.

The techniques described herein may be implemented by one or morecircuits and/or computer programs executed by one or more processors.The computer programs include processor-executable instructions that arestored on a non-transitory tangible computer readable medium. Thecomputer programs may also include stored data. Non-limiting examples ofthe non-transitory tangible computer readable medium are nonvolatilememory, magnetic storage, and optical storage.

Some portions of the above description present the techniques describedherein in terms of algorithms and symbolic representations of operationson information. These algorithmic descriptions and representations arethe means used by those skilled in the data processing arts to mosteffectively convey the substance of their work to others skilled in theart. These operations, while described functionally or logically, areunderstood to be implemented by circuit and/or computer programs.Furthermore, it has also proven convenient at times to refer to thesearrangements of operations as modules or by functional names, withoutloss of generality.

Unless specifically stated otherwise as apparent from the abovediscussion, it is appreciated that throughout the description,discussions utilizing terms such as “processing” or “computing” or“calculating” or “determining” or “displaying” or the like, refer to theaction and processes of a circuit, a computer system, or combinationthereof, that manipulates and transforms data represented as physical(electronic) quantities within the circuit and the computer systemmemories or registers or other such information storage, transmission ordisplay devices.

The foregoing description of the embodiments has been provided forpurposes of illustration and description. It is not intended to beexhaustive or to limit the disclosure. Individual elements or featuresof a particular embodiment are generally not limited to that particularembodiment, but, where applicable, are interchangeable and can be usedin a selected embodiment, even if not specifically shown or described.The same may also be varied in many ways. Such variations are not to beregarded as a departure from the disclosure, and all such modificationsare intended to be included within the scope of the disclosure.

What is claimed is:
 1. A method for quantizing pixels of an image,comprising: receiving, by a sequencing circuit, pixel values for animage captured by a camera, where the pixel values are arranged in amatrix; grouping, by the sequencing circuit, the pixel values of thematrix into columns of pixel values; and for each column in the matrix,quantizing, by an analog to digital converter, pixel values of a givencolumn using sigma delta modulation.
 2. The method of claim 1 furthercomprises quantizing pixel values in a given column as a whole, therebyminimizing accumulated quantization error from a starting pixel value inthe given column to a current pixel value in the given column.
 3. Themethod of claim 1 further comprises quantizing two or more columns ofpixel values from the matrix in parallel using sigma delta modulation 4.The method of claim 1 further comprises quantizing two or more columnsof pixel values from the matrix in series using sigma delta modulation.5. The method of claim 1 further comprises assembling quantized valuesfor each column into a rectangular array and storing the rectangulararray as a final image in a non-transitory computer-readable medium. 6.The method of claim 1 further comprises reconstructing a final imagefrom the quantized pixel values of the matrix using a decoder, where thedecoder is configured to minimize an image norm during reconstruction.7. The method of claim 5 wherein the image norm is further defined as atotal variation norm.
 8. The method of claim 1 further comprisesreconstructing a final image from the quantized pixel values of thematrix using a decoder, where the decoder is configured to minimize animage norm subject to a quantization constraint and the quantizationconstraint is defined by a finite integration operator applied to aquantization error with an L-infinite norm bounded by half of thequantization step-size, wherein the quantization error is a vector suchthat each element of the vector contains a quantization error for agiven pixel in the matrix.
 9. An imaging system, comprising: a cameraconfigured to capture an image of a scene, where the image isrepresented by pixel values arranged in a matrix; a sequencing circuitconfigured to receive the image from the camera and operates to groupingthe pixel values of the matrix into columns of pixel values; and ananalog to digital converter interfaced with the sequencing circuit and,for each column in the matrix, quantizes pixel values of a given columnusing sigma delta modulation.
 10. A method for quantizing pixels of animage, comprising: receiving, by a sequencing circuit, pixel values foran image captured by an imager, where the pixel values are arranged in amatrix; segmenting, by the sequencing circuit, the pixel values of thematrix into one or more patches of pixel values, where each patch ofpixel values is subset of pixel values from the matrix arranged in a twodimensional array; and for each patch in the matrix, quantizing, by ananalog to digital converter, the pixel values of a given patch using atwo dimensional generalization of sigma delta modulation.
 11. The methodof claim 10 wherein, for each patch in the matrix, quantizing pixelvalues along anti-diagonals of the two dimensional array.
 12. The methodof claim 11 further comprises, for each patch in the matrix, quantizingpixel values along the anti-diagonals in parallel.
 13. The method ofclaim 10 further comprises, for each patch in the matrix, quantizingpixels values in series starting from an upper left corner of the givenpatch and moving to the lower right corner of the given patch.
 14. Themethod of claim 11 wherein quantizing the sequence of pixel values for agiven patch further includes, for a given pixel in the given patch,summing quantization errors associated with at least three pixelsneighboring pixels the given pixel and rounding sum to nearest member ofan alphabet.
 15. The method of claim 11 wherein quantizing the sequenceof pixel values for a given patch in accordance withq _(i,j)=

(u _(i,j−1) +u _(i−1,j) −u _(i−1,j−1) +y _(i,j)),u _(i,j) =u _(i,j−1) +u _(i−1,j) −u _(i−1,j−1) +y _(i,j) −q _(i,j),where q_(i,j) is quantized value, u_(i,j) is quantization error, andy_(i,j) is pixel value.
 16. The method of claim 10 further comprisesassembling quantized values for each of the one or more patches into arectangular array and storing the rectangular array as a final image ina non-transitory computer-readable medium.
 17. The method of claim 10further comprises reconstructing a final image from the quantized pixelvalues of the matrix using a decoder, where the decoder is configured tominimize an image norm during reconstruction.
 18. The method of claim 16wherein the image norm is further defined as a total variation norm. 19.The method of claim 10 further comprises reconstructing a final imagefrom the quantized pixel values of the matrix using a decoder, where thedecoder is configured to minimize an image norm subject to aquantization constraint and the quantization constraint is defined toensure that a finite integration operator applied both to the left andthe right of a quantization error results in a matrix with an entry-wisemaximum absolute value bounded by half of the quantization step-size.20. A method for quantizing pixels of an image, comprising: receiving,by a sequencing circuit, pixel values for an image captured by a camera,where the pixel values are arranged in a matrix; grouping, by thesequencing circuit, the pixel values of the matrix into subsets of pixelvalues, where pixels of a given subset of pixels are neighbors to eachother; and for each subset of pixels, quantizing, by an analog todigital converter, pixel values of a given subset of pixels using sigmadelta modulation.